Download as pdf or txt
Download as pdf or txt
You are on page 1of 304

Chapter 1

Introduction

Whatever exists at all exists in some amount. To know it thoroughly involves knowing its quantity
as well as its quality. Education is concerned with changes in human beings; a change is a
difference between two conditions; each of these conditions is known to us only by the products
produced by it—things made, words spoken, acts performed, and the like. To measure any of
these products means to define its amount in some way so that competent persons will know
how large it is, better than they would without measurement. To measure a product well means
so to define its amount that competent persons will know how large it is, with some precision,
and that this knowledge may be conveniently recorded and used. This is the general credo of
those who, in the last decade, have been busy trying to extend and improve measurements of
educational products (Thorndike, 1918, p 16).

Psychometrics is that area of psychology that specializes in how to measure what we talk
and think about. It is how to assign numbers to observations in a way that best allows us
to summarize our observations in order to advance our knowledge. Although in particular it
is the study of how to measure psychological constructs, the techniques of psychometrics are
applicable to most problems in measurement. The measurement of intelligence, extraversion,
severity of crimes, or even batting averages in baseball are all grist for the psychometric mill.
Any set of observations that are not perfect exemplars of the construct of interest is open to
questions of reliability and validity and to psychometric analysis.
Although it is possible to make the study of psychometrics seem dauntingly difficult, in fact
the basic concepts are straightforward. This text is an attempt to introduce the fundamental
concepts in psychometric theory so that the reader will be able to understand how to apply
them to real data sets of interest. It is not meant to make one an expert, but merely to instill
confidence and an understanding of the fundamentals of measurement so that the reader can
better understand and contribute to the research enterprise.
With the advent of powerful computer languages that have been developed with the specific
aim of doing statistics (R and S+), the process of doing psychometrics has become more
approachable. It is no longer necessary to write long programs in Fortran (Backus, 1998),
nor is it necessary to rely on sets of proprietary computer packages that have been developed
to do particular analyses. It is now possible to use R for almost all of one’s basic (and even
advanced) psychometric requirements.
R is an open source implementation of the computer language S. As such it is available free
of charge under the General Public License (GPL) of the GNU Project of the Free Software

5
6 1 Introduction

Foundation.1 The source code of all R programs and packages are open to inspection and
change. The core of R has been developed over the past 20 years by a dedicated group (the
R Core Team) of about 15-20 members which includes some of the original authors of S.
In addition, there are at least 1000 packages that are written in R and contributed to the
overall R project by many different authors. In the psychometrics community, at least 10 to
20 packages have been developed and made available to the psychometrics user in particular
and the R community in general.
Like psychometrics, R is initially daunting. Also like psychometrics, while it takes years to
master, the basics can be learned fairly easily and expertise comes with practice. Moreover,
combining examples written and analyzed in R with psychometric problems allows one to
learn both at the same time. This text is thus both an introduction to psychometric theory
as well as to R .
The structure of this book is best represented in the form of a “structural model” showing
a number of boxes and circles with a set of connecting paths (Figure 1.1). This symbolic
notation is a way of showing the relationship between a set of observed variables (the boxes
on the far left and right of the figure) in terms of a smaller set of unobserved, or latent,
variables said to account for the observed variables (the circles in the middle of the figure).
Paths in the figure represent relationships. In the following chapters we will use this figure
to help locate where we are.

Part I: Basic issues

1.1 Constructs and measures (Chapter 2)

A basic distinction in science may be made between theoretical constructs and observed
measures thought to represent these constructs. This distinction is perhaps best understood
in terms of Plato’s Allegory of the Cave in the Republic (Book VII). Consider a group of
prisoners confined in the darkness of a cave. They are chained so that they face away from
the mouth of cave and can not observe anything behind them. Behind them there is a fire
and people are walking back and forth in front of the fire carrying a variety of objects. To
the prisoners, all that is observable are the shadows cast on the wall of the cave of the
walking people and of the objects that they carry. From the patterns of these shadows the
prisoners need to infer the reality of the people and objects. These shadows are analogous to
the observed variables that study, while the “real” but unobservable people and objects are
the latent variables about which we make theoretical inferences. The prisoners make their
inferences based upon the patterning of the shadows.
While behaviorism reigned supreme in psychology until the late 1950’s, the emphasis
was upon measurement of observed variables. With a greater appreciation of the process of

1 Under the Free Software Foundations’s GPL, all programs are required to have the following state-
ment. “This program is free software: you can redistribute it and/or modify it under the terms of the
GNU General Public License as published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE. See the GNU General Public License for more details.” The GNU.org website has an ex-
tensive discussion of the meaning of “free” software.
1.1 Constructs and measures (Chapter 2) 7

X1

Y1
X2
Xi1

eta1 Y2
X3

Y3
X4

X5 Xi2

X6 Y4

X7
eta2 Y5

X8 Xi3
Y6

X9

Fig. 1.1 A conceptual overview of the book. Psychometrics integrates the relationships between ob-
served variables (rectangles) in terms of latent or unobserved constructs (ellipses). Chapter 2 considers
what goes into a observable (e.g., the box X1 ) while Chapter 3 addresses the shape of the mapping
function between a latent variable and an observed variable (path Ξ1 to X1 ). Chapter 4 considers how
to assess relationships between two variables (e.g., the simple correlation or regression of X1 and Y1 ),
how to combine two or more variables to predict a third variable (multiple regression of Y1 on X1
and X2 ), and how to assess the relationship between two variables while holding the effect of a third
variable constant (partial correlation of X1 and Y1 with X2 partialed out). Chapter 7 considers how
to estimate the correlation between an observed variable and a latent variable (X1 with Ξ1 ) to esti-
mate the reliability of X1 . Chapter 6 addresses the question of how many latent variables are needed.
Chapter 9 considers how the structure of relationships allows alternative conceptualizations of validity.
Chapter 11 addresses how to model the entire figure.
8 1 Introduction

theory building, the use of latent variables and hypothetical constructs gradually became
more accepted in psychology in general, and in psychometrics in particular. For rarely are
we interested in specific observations of twitches and utterances. Psychological theories are
concerned with higher level constructs such as love, intelligence, or happiness. Although
these constructs can not be measured directly, good theory relates them through a process
of measurement to specific observed behaviors and physiological markers.
A very deep question, and one that will not be addressed in the detail it requires, is what
does it to mean to say we “measure” something? The assigning of numbers to observations
does not necessarily (and indeed probably does not) imply that the data are isomorphic to
the real numbers. The types of inferences that can be made from observations and the types
of analysis that are or are not appropriate for the observed data is a deep question that goes
beyond the scope of an introductory text. Some of these issues are touched upon briefly in
terms of a typology of scales (3.6.1 and data transformations (3.6.1).

1.1.1 Observational and Experimental Psychology

A recurring debate in psychology ever since Wundt (1874) and Galton (1865, 1884) has been
the merits of experimental versus observational approaches to the study of psychology. These
two approaches tend to represent different sub fields within psychology and to emphasize
different types of training. Experimental psychology tends to emphasize central tendencies
and strives to formulate general laws of behavior and cognition. Observational psychologists,
on the other hand, emphasize variability and covariation and study individual differences in
ability, character, and temperament.
For many years these two approaches also used different statistical procedures to ana-
lyze data, with experimentalists comparing means using the t-test and its generalization, the
Analysis of Variance (ANOVA). This was in contrast to observationalists who would study
variability and particularly covariation with the correlation coefficient and multivariate pro-
cedures. However, with the recognition that ANOVA and correlations are just special cases
of the general linear model, the statistical distincition is less relevant than the distinction of
research methodologies.
Despite eloquent pleas for the reunification of these two disciplines Cronbach (1957, 1975),
Eysenck (1966, 1997) and Vale and Vale (1969) there is surprisingly little emphasis upon in-
dividual differences within experimental (but see Underwood, 1975) for why individual differ-
ences are necessary for theory building in cognitive psychology) or experimental techniques in
personality research (Revelle and Oehleberg, 2008). However, both research approaches need
to understand the quality of measurements in order to make experimental or correlational
inferences.
With an emphasis upon the type of data we collect, and the problems of inference associ-
ated with the metric properties of our data, chapters 2 & 3 are particularly relevant for the
experimentalist who wants to interpret differences between observed means in terms of differ-
ences at a latent level. These two chapters are also important for the student of correlations
who wants to interpret scale score values as if they have more than ordinal meaning.
1.2 A theory of data (Chapter 2) 9

1.1.2 Data = Model + Residual

Perhaps the most important concept to realize in psychometrics in particular and statistics
in general is that we are modeling data, not merely reporting it. Complex data can be partly
understood in terms of simpler theories or models. But our models are incomplete and do not
completely capture the data. The data we collect, no matter how carefully we do it, nor how
well we understand the process that generates the data, are never quite what we expect. The
world is complex and the observations that we take are multiply determined. At the most
basic atomic level of physics, quantum randomness occurs. At the level of the neuron, coding
is a statistical frequency of firing, not a binary outcome. At the level of human behavior
although our theories might be powerful (which they tend not to be) there are causes that
have not been observed.
Throughout the book we will propose models of the data and try to evaluate those models
in terms of how well they fit. Conceptually, we use the equations

Data = Model + Residual (1.1)

to represent the problem of inference. We evaluate how well our models fit by examining
Residual as defined as
Residual = Data − Model (1.2)
and evaluate the magnitude of some function of the residual (Judd and McClelland, 1989).
These equations would seem to imply a greater quantitative level of measurement precision
than is generally the case, and should be treated as abstractions to remind us that we are
evaluating models of data and need to continuously ask how appropriate is the model. The
distinction between model and residual is not new, for it dates at least to Plato. As discussed
by Stigler (1999) the whole of nineteenth century statistical theory was based upon this
distinction between physical truth as modeled by Newton and actual observations taken
to extend the theories. In statistical analysis, residual is typically treated as error, but by
treating our data in terms of what we know (model) and what we don’t know (residual) we
recognize that better models can explain some of the residual as well.

1.2 A theory of data (Chapter 2)

Clyde Coombs introduced a taxonomy of the kinds of data that we can observe that allows
us to abstractly organize what goes into each observation (Coombs, 1964). Although many
of the examples in this book are drawn from a small portion of the kinds of data that could
be collected, by thinking about the basic distinctions made by Coombs we see the range of
possibilities for psychometric applications. Coombs treated the problem of data in an abstract
form that puts psychometrics in the tradition of broader measurement theory.
10 1 Introduction

1.3 Basic summary statistics – problems of scale (Chapter 3)

The problem of how to summarize data reflects some deep issues in measurement. The naive
assumption that our measures are linearly related to our constructs leads to many a misinter-
pretation. Indeed, to some, this assumption reflects a pathological thought disorder (Michell,
1997). Although not taking such a strong position, examples of misinterpretation of findings
because of faulty assumptions of interval or ratio levels of measurement are easy to find (3.6).
Alternative ways of estimating central tendencies can lead to very different conclusions about
sets of data (3.5). Some of the issues raised in terms of the problem of scale will be addressed
again in the discussion of Item Response Theory (Chapter 8).

1.4 Covariance, regression, and correlation (Chapter 4)

Perhaps the most fundamental concept in psychometrics is the correlation coefficient. How
to best represent the relationship between two or more variables is a fundamental problem
that, if understood, may be generalized to most of psychometrics. Correlation takes many
forms and understanding when to use which type of correlation is important.

1.5 Multiple and partial correlation and regression (Chapter 5)

Generalizing the principles of correlation and regression to the case of multiple predictors
provides the basis of factor analysis, classic reliability theory as well as structural equation
modeling. Much of modern data analysis may be seen as special cases of the general linear
model relating one or many dependent variables to one or many predictor variables with
various assumptions about the scaling properties of the errors.

Part II: Latent variables

The concept of a latent variable is fundamental to psychometrics. The actual measurements


we take (our observed measured) are assumed to be probabilistic functions of some underlying
unobservable (latent) construct.

1.6 Factor, Principal Components and Cluster Analysis (Chapter 6)

How many latent constructs are represented in a data set of N variables? Is it possible to
reduce the complexity of the data without a great loss of information? How much informa-
tion is lost? Early in the development of a particular subarea, exploratory data reduction
techniques are most appropriate. These range techniques can include cluster analysis and
principal components analysis as well as exploratory factor analysis. All of the procedures
1.7 Classical theory and the Measurement of Reliability (Chapter 7) 11

are faced with the problem of what is the appropriate amount of data reduction and how to
evaluate alternative models. Confirmatory models, primarily confirmatory factor models, are
special cases of structural equation modeling procedures.

Part III: Classical and modern reliability theory

1.7 Classical theory and the Measurement of Reliability (Chapter 7)

How well does a scale measure whatever it is measuring? Do alternative measures give the
same or similar values? Are measures of the construct the same over time, over items, over
situations? These are the basic questions of both classical test theory as well as its general-
ization to Latent Trait and Item Response Theory.

Parallel tests and their generalizations

If one observed scale is thought to be composed of True score and Error, then the correlation
of the test with true score may be calculated in terms of the proportion of variance that is
True score. But how to estimate this? Parallel tests, tau equivalent tests, and congeneric test
models make progressively fewer assumptions of the data and all yield ways of estimating
true scores.

Domain sampling theory

An alternative approach, that yields the same solution as congeneric test theory is to think of
tests as representing larger and larger sets of items sampled from an infinite domain of items.
By thinking in terms of domain sampling, the meaning of alternative estimates of reliability
(α, β , ω) is easily understood. Hierarchical structures of tests in terms of group and general
factors emphasizes the need to understand the test structure.

The many sources of reliability: Generalizability theory

Reliability needs to be considered in terms of the dimensions across which we want to gener-
alize our measures. A measure of a single trait should be a good indicator of a single domain
and should be consistent across time and situation. A measure of a mood, however, should
have high internal consistency (be a good measure of a single domain) but should not show
consistency over time. In a prediction setting, it is possible that a test need not have high
internal consistency, but it should be stable over time, situations, and perhaps forms. Relia-
bility is not just a concept of the item, but also is concerned with the source of the data. Do
multiple raters agree with each other, do various forms of the test give similar answers.
12 1 Introduction

1.8 Latent Trait Theory - The “New Psychometrics” (Chapter 8)

Although Classical Test Theory treats items as random replicates, it is possible to consider
item parameters as well. This leads to a more efficient means of estimating person parameters
and also emphasizes issues of scaling shape. Extensions to two and three parameter models,
ability and unfolding models, and dichotomous versus polytomous models will be considered.

1.9 Validity (Chapter 9)

Does a test measure what it supposed to test? How do we know? The most direct (and
perhaps least accurate) way is to simply examine the item content. Does the content appear
related to the construct of interest. Face (which is sometimes known as “Faith”) validity,
address the question of obvious relevance. Do questions about psychometric knowledge ask
about matrices or do they ask about general knowledge of English? The former item would
seem to be more valid than the latter.
If tests are used for selection or diagnosis, merely looking good is not enough. It is also
necessary to assess how well the measure correlates with current status of known criterion
groups or how well the test predicts future outcomes. Concurrent and predictive validity
assess how well tests correlate with alternative measures of the same construct right now and
do they allow future predictions?
For theory testing and development, validity is a process more than a particular value. In
assessing the construct validity of a measure, it is necessary to examine the location of the
test in the complete nomological network of the theory. Assessing convergent validity asks
whether measures correlate with what they should correlate with given the theory? Equally
important is discriminant validity: Do measures not correlate with what the theory says they
should not correlate with? A final part of construct validity is emphincremental validity: does
it make any difference if we add a test to a battery?

1.9.1 Decision Theory

The practical use of tests also involves knowing how to combine data to make decisions.
Although the measures used to predict and to validate tend to be continuous, decisions and
outcomes are frequenenty discrete. Students are admitted to graduate school or they are not.
They finish their Ph.D. or they do not. People are offered jobs, are promoted, are accused
of crimes, and are found guilty or innocent. These are binary decisions and binary outcomes
based upon linear and non-linear models of predictors. In addition to considering the base
rates of outcomes and the selectivity of the choice process, the utility of test reflects the value
applied to the various types of outcomes as well as the cost of developing and giving the test.
1.11 General comments 13

1.10 Structural Equation Modeling (Chapter 11)

Structural Equation Modeling = Reliability + Validity. How to evaluate the measurement


model and the structural model at the same time. There are severe limitations on the type
of inferences that can be drawn, even from the best fitting structural equation. Through the
use of simulated data representing various threats to measurement, it is possible to better
understand how to properly interpret results of standard sem packages.

Part IV: The construction of tests and the analysis of data


(Chapter 17)

Practical suggestions about how to construct scales based upon basic item statistics. For
students and practitioners with limited resources, some procedures are much more useful
than others. What are the tradeoffs involved in making particular decisions when developing
tests to use in research and applied settings. Knowing a few simple rules of test construction
and evaluation helps speed up the cycle of test development.

Appendices

1.10.1 Appendix – Basic R

The basic commands and methods for using R . How to get, install, and use the basic R
packages.

1.10.2 Appendix - Review of Matrix algebra

Although understanding matrix algebra is not completely necessary to understand psycho-


metrics, it makes it much easier. Because it is more abstract, matrix notation is far more
compact than the alternative notation using the summation of cross products. In addition,
programming operations in R using matrices produces much cleaner and faster code. This
appendix is what you should have learned in college but have probably forgotten.

1.11 General comments

This book is aimed for beginners in psychometrics (and perhaps in R) who want to use
the basic principals of psychometric theory in their substantive research. As an introduction
to psychometrics some major philosophical issues about the meaning of measurement (e.g,
Barrett (2005), Borsboom and Mellenbergh (2004), and Michell (1997)) will not be discussed
in the detail they deserve, nor will many of the basic models be derived from first principles in
14 1 Introduction

the manner of Guilford (1954), McDonald (1999), or Nunnally (1967). It is hoped, however,
that the reader will become interested enough in the theory and practice of psychometrics
to delve into those much deeper texts.
Most scientists read books backwards. That is, we start at the later chapters and if under-
stand them, we are finished. If we don’t , we go to an earlier chapter and test ourselves with
that. For that reason, the appendix on R is meant to allow the eager reader to start running
programs in R without reading anything else. However, the introductory chapters are meant
to be useful as they consider the meaning of our observations, the inferences we are able to
draw from observations and the inferences we can not make.
Chapter 2
A Theory of Data

At first glance, it would seem that we have an infinite way to collect data. Measuring the
diameter of the earth by finding the distance to the horizon, measuring the height of waves
produced by a nuclear blast by nailing (empty) beer cans to a palm tree, or finding Avogadro’s
number by dropping oil into water are techniques that do not require great sophistication in
the theory of measurement. In psychology we can use self report, peer ratings, reaction times,
psychophysiological measures such as the Electric Encephelagram (EEG), the basal level of
Skin Conductance (SC), or the Galvanic Skin Response (GSR). We can measure the number
of voxels showing activation greater than some threshold in a functional Magnetic Resonance
Image (fMRI), or we can measure life time risk of cancer, length of life, risk of mortality,
etc. Indeed, the basic forms of data we can collect probably are unlimited. But in fact, it
is possible to organize these disparate forms of data in terms of an abstract organization in
terms of what is being measured and in comparison to what.

2.1 A Theory of Data: Objects, People, and Comparisons

Consider the following numbers and try to assign meaning to them: (2.718282), (3.1415929),
(24), (37), (98.7), (365.256363051), (3,413), (86,400), (31,557,600), (299,792,458), and (6.0221415±
0.0000010) × 1023 1 . Because a number by itself is meaningless, all measures reflect a com-
parison between at least two elements. Without knowing the units of measure it is difficult
to recognize that 37 °C and 98.7 °F represent the same average body tempature, that 24
(hours) and 86,400 (seconds) both represent one day, and that 365.256 (days) is the same
length of time as 31,557,600 (seconds).
The comparison can be one of order (which is more) or one of proximity (are the numbers
the same or almost the same?). Given a set of Objects (O) and a set of People (P) Clyde
Coombs, in his Theory of Data organized the types of measures one can take in a 2 x 2
x 2 table of the possible combinations of three distinct dimensions (Coombs, 1964). The
elements may be drawn from the set of People (P), the set of Objects (O), or the Cartesian
Cross Products of the sets of People (P x P), Objects (OxO), or People by Objects (PxO).

1 e, pi, hours in day, average body temperature in °C, average body temperature in °F, days in a
sidereal year, BTUs/KWH, seconds in a day, seconds in a year, speed of light in meters/sec, number
of atoms in a mole (Avogadro’s Number).

15
16 2 A Theory of Data

Furthermore, we can compare the results of these comparisons by forming either single dyads
or pairs of dyads.2
That is, if we let O j refer to the jth object and Pi to the ith person, then we can compare
O j to Ok , Pi to Pk , Pi to O j , and so on.
1. The type of comparison made can either be one of order (is Pi < Pk ) or of distance (if
δ = |Pi − Pk |, then is δ < X)?
2. The elements being compared may be People, Objects, or People x Objects.
3. The number of dyads may be either one or more.

2.1.1 Modeling the comparison process

The two types of comparisons considered by Coombs were an order relationship and a prox-
imity relationship. Given three objects on a line, we say that if A is greater than B if A- B
> 0. Similarly, B is greater than C if B-C > 0. Without error, if A, B and C are on a line, if
A > B and B > C, then A > C. With error we say that

p(A > B|A, B) = f (A − B). (2.1)

Alternatively, A is close to B if the absolute difference between them, is less than some
threshold, δ .
p(|A − B| < δ |A, B, δ ) = f (|A − B|, δ ). (2.2)
This distinction may be seen graphically by considering the probability of being greater as a
function of the distance A -B (Fig 2.1) or the absolute difference between A and B.3
By using these three dimensions, it is possible to categorize the kind of data that we collect
(Table 2.1).

2.2 Models and model fitting

For all of the following examples of estimating scale values it is important to ask how well do
the estimated scale values recreate the data from which they were derived. Good scale values
for the objects or for the people should provide a better fit to the data than do bad scale
values. That is, given a data matrix, D, with elements di j , we are trying to find model values,
mi and m j such that some function, f, when applied to the model values, best recreates di j .
For data that are expressed as probabilities of an outcome, the model should provide a rule
for comparing multiple scale values that are not necessarily bounded 0-1 with output values
that are bounded 0-1. That is, we are interested in a mapping function f such that for any
values of mi and m j
0 ≤ f (mi , m j ) ≤ 1 (2.3)

2 This taxonomy can be generalized if we consider a third component of measurement: when is the
measurement taken. We will consider the implications of a three dimensional organization in terms of
Cattell’s Databox (Cattell, 1966a) in chapter ??
3 The R-code for this and subsequent figures is included in Appendix-H
2.2 Models and model fitting 17

Order Proximity

1.0

1.0
0.8

0.8
probability |A−B| < delta
probability A > B

0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

−3 −1 1 2 3 −3 −1 1 2 3

A−B A−B

Fig. 2.1 Left panel: The probability of observing A > B as a function of the difference between A
and B. The greater the signed difference, the greater the probability that A will be reported as greater
than B. The three lines represent three different amounts of sensitivity to distance. Right Panel: The
probability of observing A is the same as (close to) B as a function of the difference between A and B.
The less the absolute difference, the greater the probability they will be reported as the same.

Although there are any number of such functions, there are at least two conventional ones
that have such a property, one is the inverse normal transformation (where p values are
mapped into the corresponding z values of a cumulative normal distribution), the other is
the inverse logistic function (where p values are mapped onto the corresponding values of the
logistic function). Both of these mappings satisfy the requirements of Equation 2.3 for any
values of x and y.
Remembering Equations 1.1 and 1.2, we need to find scale values that minimize some
function of the error. Applying f(mi ,m j ) for all values of i and j produces the model matrix
M. Let the error matrix E = D - M. Because average error will tend to be zero no matter how
badly the model fits, median absolute error or average squared error are typical estimates
of the amount of error. But such estimates are essentially “badness of fit” indices; goodness
of fit indices tend to be 1 - badness. Both goodness or badness estimates should somehow
reflect the size of the error with respect to the original data matrix. Thus, a generic estimate
of goodness of fit becomes
18 2 A Theory of Data

Table 2.1 The theory of data provides a 3 x 2 x 2 taxonomy for various types of measures

Elements of Dyad Number Comparison Name Chapter


of Dyads
People x People 1 Order Tournament rankings Theory of Data 2.4
People x People 1 Proximity Social Networks Theory of Data 2.5
Objects x Objects 1 Order Scaling Thurstonian scaling 2.6.2
Objects x Objects 1 Proximity Similarities Multidimensional scaling
2.6.2
People x Objects 1 Order Ability Measurement Test Theory 7, 8.1
People x Objects 1 Proximity Attitude Measurement Attitudes 8.5.2
People x People 2 Order Tournament rankings
People x People 2 Proximity Social Networks Theory of Data 2.5
Objects x Objects 2 Order Scaling Theory of Data 2.7
Objects x Objects 2 Proximity Multidimensional scaling Theory of Data 2.7
Individual Differences in
MDS
People x Objects 2 Order Ability Comparisons
People x Objects 2 Proximity Preferential Choice Unfolding Theory 2.8

GF = f (Data, Model) (2.4)


Variations on this generic goodness of fit estimate include Ordinary Least Squares Estimates
such as
(Data − Model)2
GF = (2.5)
Data2
as well as measures of median absolute deviation from the median, or many variations on
Maximum Likelihood Estimates of χ 2 .

2.3 A brief diversion: Functions in R

The examples in the rest of this (and subsequent) chapter(s) are created and analyzed using
small example snippets of R code. For the reader interested just in psychometrics, these
snippets can be ignored and the text, tables, and figures should suffice. However, reading the
brief pieces of code and trying to run them line by line or section by section will help the
reader learn how to use R. Even if you choose not to run the R-code while reading the text,
by reading the R, some familiarity with R syntax will be gained.
As discussed in much more detail in Appendix A, R is a function driven language. Almost
all operations invoke a function, usually by passing in some values, and then taking the
output of the function as an object to be used in a later function. All functions have the
form of function(parameter list). Some parameter lists are empty. Most functions have names
that are directly understandable, at least by context. To see how a function works, entering
the name of the function without the parentheses will usually show the function, although
some functions are invisible or hidden in namespaces. A list of all functions used in the text
is in Appendix B. Table B.1 briefly describes all the functions used in this chapter.
Programming in R can be done by creating new functions made up of other functions.
Packages are merely combinations of these new functions that are found to be useful. The
2.4 Tournaments: Ordering people (pi > p j ) 19

psych package is a collection of functions that have proven to be useful in doing psychometrics.
The functions within the package can be examined by entering the function name without
the parentheses. For simplicity of reading the text, if more than a few lines of code are needed
for the example, the R code will be included in appendix C rather than the text.
To obtain more information about any function, using the help function (?) provides a
definition of the function, the various options possible when calling it, and examples of how to
use the function. Studying the help pages will usually be enough to understand how to use the
function. Perhaps the biggest problem is remembering the amazing number of functions that
are available. Various web pages devoted to just listing the most used functions are common
(see, e.g., R/Rpad reference card at http://www.rpad.org/Rpad/Rpad-refcard.pdf).

2.4 Tournaments: Ordering people (pi > p j )

The most basic datum is probably comparing one person to another in terms of a direct order
relationship. This may be some sort of competition. Say we are interested in chess playing
skill and we have 16 people play everyone else in a series of round robin matches. This leads
to matrix of wins and losses. In Table 2.2, let a 1 in a cell mean that the row person beat the
column person. NAs are put on the diagonal since people do not play themselves. The data
were created by a simple simulation function (Appendix G) that assumed players differed in
their ability and then created the win loss record probabilistically using a logistic function.
Because A beating B implies B loses to A, elements below the diagonal of the matrix are just
1 - those above the diagonal.
1
Prob(win|Pi , Pj ) = Prob(Pi > Pj |Pi , Pj ) = (2.6)
1 + e(Pj −Pi )

Table 2.2 Simulated wins and losses for 16 chess players. Entries reflect row beating column.
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16
P1 NA 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0
P2 1 NA 0 0 0 1 0 0 1 0 0 0 0 0 0 0
P3 0 1 NA 0 1 1 0 0 0 0 1 0 0 0 0 0
P4 1 1 1 NA 1 0 1 0 0 1 0 0 0 0 0 0
P5 1 1 0 0 NA 1 1 1 1 0 1 0 1 0 0 1
P6 1 0 0 1 0 NA 1 1 1 1 0 1 0 0 0 0
P7 1 1 1 0 0 0 NA 1 0 0 0 0 0 1 0 0
P8 0 1 1 1 0 0 0 NA 1 0 0 0 0 0 0 0
P9 1 0 1 1 0 0 1 0 NA 1 0 0 1 1 0 0
P10 1 1 1 0 1 0 1 1 0 NA 0 0 1 1 1 0
P11 1 1 0 1 0 1 1 1 1 1 NA 1 0 0 0 1
P12 1 1 1 1 1 0 1 1 1 1 0 NA 1 1 1 0
P13 1 1 1 1 0 1 1 1 0 0 1 0 NA 1 0 0
P14 1 1 1 1 1 1 0 1 0 0 1 0 0 NA 0 0
P15 1 1 1 1 1 1 1 1 1 0 1 0 1 1 NA 1
P16 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 NA
20 2 A Theory of Data

2.4.1 Scaling of People

There are multiple ways to score these results. One is the simple average score (found by tak-
ing the rowMeans), a second is convert the winning percentage to a normal score equivalent,
the third is to convert the winning percentage to a logistic equivalent. It is easy to do all
three scoring procedures and to graphically compare them. This is done using the standard
data.frame function and the pairs.panels function from the psych package.

> score <- rowMeans(tournament, na.rm = TRUE)


> qscore <- qnorm(score)
> logit <- log(score/(1 - score))
> chess.df <- data.frame(latent = p, observed = score, normed = qscore, logit)

Table 2.3 Three alternative solutions to the chess problem of Table 2.2
latent observed normed logit
P1 -1.5 0.13 -1.11 -1.87
P2 -1.3 0.20 -0.84 -1.39
P3 -1.1 0.27 -0.62 -1.01
P4 -0.9 0.40 -0.25 -0.41
P5 -0.7 0.60 0.25 0.41
P6 -0.5 0.47 -0.08 -0.13
P7 -0.3 0.33 -0.43 -0.69
P8 -0.1 0.27 -0.62 -1.01
P9 0.1 0.47 -0.08 -0.13
P10 0.3 0.60 0.25 0.41
P11 0.5 0.67 0.43 0.69
P12 0.7 0.80 0.84 1.39
P13 0.9 0.60 0.25 0.41
P14 1.1 0.53 0.08 0.13
P15 1.3 0.87 1.11 1.87
P16 1.5 0.80 0.84 1.39

Just assigning numbers is not enough, for it is important to evaluate how well the assigned
numbers capture the data. This requires a model of how to combine the rankings to predict
the outcome. The average percent wins would seem reasonable until we consider how to
combine them. A simple difference would not work, for that could lead to values outside of
the range. In analogy to the axioms of choice (Bradley and Terry (1952), Luce (1977),) we
could predict that the probability of A beating B (p(A >B) is the ratio of the frequency of
A winning divided by the sum of the frequencies that A or B wins:

p(A)
p(A > B|A, B) = . (2.7)
p(A) + p(B)

For the normal deviate scores, a natural model would be to find the probability that A > B
by finding the cumulative normal value of the normal-score difference:

p(A > B|A, B) = pnorm(normalA − normalB ) (2.8)


2.4 Tournaments: Ordering people (pi > p j ) 21

> pairs.panels(chess.df)

0.2 0.6 −2 0 1 2

latent

0.0 1.0
0.84 0.84 0.84

−1.5


● ●

observed
0.6

1.00 1.00
● ● ●

● ●●


● ●
0.2


1.0
● ●

● ● ●
normed
1.00
● ●
● ● ● ●

0.0
● ●
● ●● ●

● ●
● ●
● ● ●

−1.0
● ●

● ●
2

● ● ●

● ● ● ●
logit
1

● ● ●
● ● ● ● ●

●● ●
●● ●●
0

● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
−2

● ● ●

−1.5 0.0 1.0 −1.0 0.0 1.0

Fig. 2.2 Original model and three alternative scoring systems. This is also an example of a SPLOM
(scatter plot matrix) using the pairs.panels function from the psych package. A SPLOM shows the X-Y
scatter plots of each pair of variables. The x values in each plot reflect the column variable, the y values,
the row variable. The locally best fitting line is drawn as well as a error ellipse for the correlation. The
diagonal values are histograms with a density smooth. Numerical values of the Pearson corrlation are
shown above the diagonal.

For the logistic scores, the natural model would be to find the probability based upon the
difference of the logistic scores:
1
p(A > B|A, B) = (2.9)
1 + e(logitB −logitA )
For the observed scores, any of these would probably make equally good sense. The function
scaling.fits can be used to find the goodness of fit for the choice (eq 2.7), normal (eq 2.8),
or logistic (eq 2.9) model. The function uses a a list to hold multiple returned values. Note
that the choice model will not work for negative scale values, and thus when applying it to
normal or logistic modeled data, it is necessary to add in the minimum value.
22 2 A Theory of Data

Applying the scaling.fits function to the three scaling solutions found in chess.df yields
9 different estimates of goodness of fit (choice, logistic and normal models for each of the
three scoring systems of the basic data set). The output of any R function is an object that
can be used as input to any other function. We take advantage of this to make repeated calls
to the scaling.fits function and collect the output in a matrix (fits).
choice logistic normal
observed 0.64 0.58 0.61
normed 0.64 0.66 0.67
logistic 0.64 0.67 0.66
The goodness of fit tests suggests that both the normal and logistic procedures provide a
somewhat better fit of the data than does merely counting the percentage of wins.

2.4.2 Alternative approaches to scaling people

The previous example of a tournament forced all players to play each other. Although this
can be done with small groups, it is not feasible for larger groups. Tournament play for
large sets of players (or teams) needs to use alternative measurement models. The NCAA
basketball tournament of 65 teams is a well known alternative in which teams are eliminated
after loses. This allows a choice of an overall winner, but does not allow for precise rankings
below that. In addition, it is important to note in the simulated example (Table 2.2), that
a better player (P16) was defeated by a weaker player (P5), even though P16 had a much
higher winning percentage (80%) than did P5 (60%). In addition, in this 16 player match, the
observed rankings were correlated only .84 with the underlying ability (the latent score) used
to generate the data. Thus, neither a sudden death tournament nor a round robin tournament
necessarily leads to identifying the strongest over all player or team.
Using a system developed by Arpad Elo (and hence called the Elo scale) chess players
are ranked on a logistic scale where two players with equal scores have a 50% probability
of winning and a player 200 points higher than another would win 75% of the time. The
Elo system does not require all players to play each other, but rather adds and subtracts
points for each player following a match. Beating a better player adds more points than
does beating a weaker player, and losing to a player with a lower ranking subtracts more
points than losing to a player with a higher ranking. Revisions have been suggested by Mark
Glickman and others.
Logistic rating may be applied to the problem of rankings of colleges (Avery et al., 2004) as
well as sports teams. Using the pattern of student choice as analogous to direct competition
(School A beats B if student X chooses to attend A rather than B), and then scaling the
schools using the logistic model provides a better metric for college rankings than other
procedures. (As Avery and his colleagues point out, selectivity ratings can be “gamed” by
encouraging many weaker students to apply and then rejecting them, and yield ratings can
be inflated by rejecting students who are more likely to go somewhere else.)
2.5 Social Networks: Proximities of People (|pi − p j | < δ ) 23

2.4.3 Assigning numbers to people – the problem of rewarding


merit

The previous examples considered that people are in direct competition with each other
and are evaluated on the same metric. This is not just an abstract problem for evaluating
chess players or basketball teams, but is a real life problem when it comes to assigning
salaries and raises. But it is not just a problem of evaluating merit, it is also a problem
in how to treat equal merit. Consider a department chairman with $100,000 of merit raises
to distribute to 20 faculty members who have an average salary of $100,000. That is, the
salary increment available is 5% of the total salary currently paid. Assume that the current
range of salaries reflects differences in career stage and that the range of current salaries
is uniformly distributed from $50,000 to $150,000. Furthermore, make the “Lake Wobegon”
assumption that all faculty members are above average and all deserve an “equal” raise.
What is the correct average? Is it $5,000 per faculty member ($100,000/20) or is 5% for each
faculty member (with the lecturers getting $2,500 and the full professors getting three times
as much, or $7,500)? This problem is considered in somewhat more detail when comparing
types of scales in section 3.16

2.5 Social Networks: Proximities of People (|pi − p j | < δ )

An alternative to ordering people is to ask how close two people are to each other. This can
be done either for all possible pairings of people, or for a limited set of targets. In both cases,
the questions are the same: Are two people closer than some threshold, X: if δ = |Pi − Pj |,
then is δ < X? This very abstract representation allows us to consider how well known or
liked or desirable someone is, depending upon the way we phrase the question to person i:
1. Do you know person j?
2. Do you like person j? or as an alternative:
3. Please list all your friends in this class (and is j included on the list)
4. Would you be interested in having a date with person j?
5. Would you like to have sex with person j?
6. Would you marry person j?

2.5.1 Rectangular data arrays of similarity

If we ask a large group of people about a smaller set (perhaps of size one) of people, we
will form a rectangular array of proximities. For example, evolutionary psychologists have
used responses to items 4-6 asked by an attractive stranger (person j) to show strong sex
differences in interest in casual sex (Buss and Schmitt, 1993). Typical data might look like
Table 2.4. Sociologists might use questions 1 - 3 to exam social networks in classrooms.
Note that the data will not necessarily, and in fact probably will not, be symmetric. For
more people know (or know of) Barack Obama than he could possibly know.
24 2 A Theory of Data

Table 2.4 A hypothetical response matrix for questions 4-6 about social interaction with an attractive
stranger.

Person Gender Item 4 Item 5 Item 6


1 F 0 0 0
2 F 1 0 0
3 F 1 1 0
...
98 M 1 1 0
99 M 1 1 1
100 M 1 1 1

2.5.2 Square arrays of similarity

Another example of proximity data for pairs of people would be the results of “speed dating”
studies (Finkel et al., 2007). Given N people, each person spends a few minutes talking to
each other person in the room. After each brief conversation each person is asked whether
they want to see the other person again. (Abstractly, the assumption is that the smaller the
distance, δ , between two people, the more a person would want to see the other person). If
δ < X the person responds yes.) Here, although the matrix is square (everyone is compared
with everyone else, the proximities are not symmetric, for some people are liked (close to)
more people than others.
We simulate the data using an “experimental design” where each of 10 males interact for
3 minutes with each of 10 females, and vice versa. After each interaction both members
of the pair were asked whether they wanted to see the other person again. To simulate an
example of such data we create a 20 x 20 array of person “interest” by randomly sampling
with replacement from the numbers 0 and 1. Let the rows represent our participants and the
columns the expression of interest they have in the other participants. The first 10 participants
are females, the second 10 males (Table 2.5).
> set.seed(42)
> prox <- matrix(rep(NA, 400), ncol = 20)
> prox[11:20, 1:10] <- matrix(sample(2, 100, replace = TRUE) - 1)
> prox[1:10, 11:20] <- matrix(sample(2, 100, replace = TRUE) - 1)
> colnames(prox) <- rownames(prox) <- c(paste("F", 1:10, sep = ""),
+ paste("M", 1:10, sep = ""))
> prox

Because of the experimental design, the data matrix has missing values for same sex pairs.
We find the row and column means but specify that we want to not include the missing
values.
> colMeans(prox, na.rm = TRUE)

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
0.8 0.5 0.6 0.6 0.6 0.4 0.7 0.2 0.4 0.7 0.6 0.7 0.6 0.9 0.4 0.5 0.4 0.5 0.6 0.4
> rowMeans(prox, na.rm = TRUE)
2.6 The Scaling of Objects (oi < o j ) 25

Table 2.5 Hypothetical results from a speed dating study.


F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
F1 NA NA NA NA NA NA NA NA NA NA 1 1 0 1 0 1 1 1 1 0
F2 NA NA NA NA NA NA NA NA NA NA 0 1 0 1 0 0 1 0 1 1
F3 NA NA NA NA NA NA NA NA NA NA 0 1 1 1 0 1 1 0 0 0
F4 NA NA NA NA NA NA NA NA NA NA 0 0 1 1 1 0 0 0 0 0
F5 NA NA NA NA NA NA NA NA NA NA 1 1 1 1 1 1 1 0 1 0
F6 NA NA NA NA NA NA NA NA NA NA 1 1 0 1 1 1 0 0 1 0
F7 NA NA NA NA NA NA NA NA NA NA 1 0 1 1 0 0 0 1 1 1
F8 NA NA NA NA NA NA NA NA NA NA 1 0 1 0 1 0 0 1 1 0
F9 NA NA NA NA NA NA NA NA NA NA 1 1 0 1 0 0 0 1 0 1
F10 NA NA NA NA NA NA NA NA NA NA 0 1 1 1 0 1 0 1 0 1

M1 1 0 1 1 0 0 1 0 1 1 NA NA NA NA NA NA NA NA NA NA
M2 1 1 0 1 0 0 1 0 0 0 NA NA NA NA NA NA NA NA NA NA
M3 0 1 1 0 0 0 1 0 0 0 NA NA NA NA NA NA NA NA NA NA
M4 1 0 1 1 1 1 1 0 1 1 NA NA NA NA NA NA NA NA NA NA
M5 1 0 0 0 0 0 1 0 1 1 NA NA NA NA NA NA NA NA NA NA
M6 1 1 1 1 1 1 0 1 1 1 NA NA NA NA NA NA NA NA NA NA
M7 1 1 0 0 1 1 0 0 0 0 NA NA NA NA NA NA NA NA NA NA
M8 0 0 1 0 1 0 1 0 0 1 NA NA NA NA NA NA NA NA NA NA
M9 1 0 0 1 1 0 1 1 0 1 NA NA NA NA NA NA NA NA NA NA
M10 1 1 1 1 1 1 0 0 0 1 NA NA NA NA NA NA NA NA NA NA

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
0.7 0.5 0.5 0.3 0.8 0.6 0.6 0.5 0.5 0.6 0.6 0.4 0.3 0.8 0.4 0.9 0.4 0.4 0.6 0.7

Given the data matrix, prox, the column means represent how much each person was liked
(participants F8 and M3 seem to be the least popular, and participant F1 and M4 the
most popular). The row means represent differences in how much people liked others with
participants F5 and M6 liking the most people, and participants F4 and M3 liking the least
number. What is not known from the simple assignment of average numbers is whether this
is the appropriate metric. That is, is the difference between having 50% of the people like
you versus 60% the same as the difference between 80% and 90%?
The Social Network Analysis, sna, package in R allows for a variety of statistical and
graphical analyses of similarity matrices. One such analysis is the comparison of different
networks. Another is the ability to graph social networks (Figure 2.3).

2.6 The Scaling of Objects (oi < o j )

When judging the value of a particular object and a specific dimension, we can form scales
based upon the effect the object has on similar objects. This technique has long been used
in the physical sciences. For example, Friedrich Mohs’ (1773-1839) scale of hardness is an
ordinal scale based upon direct comparisons of minerals. If one mineral will scratch another,
we say the first is harder than the second. Hardness is transitive, in that if X scratches Y
and Y scratches Z, then X will scratch Z (Table 2.6). The Mohs’ hardness scale is an ordinal
scale and does not reflect that the differences between the 10 minerals can be measured more
precisely in terms of the amount of force required to make the scratch using a diamond tip
26 2 A Theory of Data

F1
M10 F2
M9 F3

M8 F4

M7 F5

M6 F6

M5 F7

M4 F8

M3 F9
M2 F10
M1

Fig. 2.3 Social networks analysis of the data from Table 2.5

and a sclerometer. Even more precise measures may be made by measuring the size of an
indentation made by diamond tip under various levels of pressure (Burchard, 2004). When
this is done the Mohs scale can be converted from an ordinal to a relative hardness scale.
Note that a difference of 1 on the Mohs scale is a difference of hardness ranging from 3% to
a factor of 89!.

Table 2.6 Mohs’ scale of mineral hardness. An object is said to be harder than X if it scratches X.
Also included are measures of relative hardness using a sclerometer (for the hardest of the planes if
there is a ansiotropy or variation between the planes) which shows the non-linearity of the Mohs scale
(Burchard, 2004).

Mohs Hardness Mineral Scratch hardness


1 Talc .59
2 Gypsum .61
3 Calcite 3.44
4 Fluorite 3.05
5 Apaptite 5.2
6 Orthoclase Feldspar 37.2
7 Quartz 100
8 Topaz 121
9 Corundum 949
10 Diamond 85,300
2.6 The Scaling of Objects (oi < o j ) 27

Another way of ordering objects is in terms of their effects upon the external environment.
For sailors, it is important to be able to judge wind conditions by observation. Although the
effect of wind varies as the square of the velocity, a roughly linear metric of wind speed based
upon observed sea state was developed by Sir Francis Beaufort (1774-1857) and is still in
common use among windsurfers and sailors. Beaufort’s original scale was in terms of how a
British frigate would handle and what sails she could carry but was later revised in terms of
observations of sea state. Beaufort did not classify wind speed in terms of velocity and these
estimated equivalents as well as the current descriptions have been added by meteorologists
(Table 2.7). What is most important to notice is that because of the non-linear (squared)
effect of wind velocity on sailors, equal changes in the Beaufort scale (e.g., from 1 to 2 or
from 4 to 5) do not lead to equal changes in such important outcomes as the probability of
capsizing!

Table 2.7 The Beaufort scale of wind intensity is an early example of a scale with roughly equal units
that is observationally based. Although the units are roughly in equal steps of wind speed in nautical
miles/hour (knots), the force of the wind is not linear with this scale, but rather varies as the square
of the velocity.

Force Wind (Knots) WMO Classification Appearance of Wind Effects


0 Less than 1 Calm Sea surface smooth and mirror-like
1 1-3 Light Air Scaly ripples, no foam crests
2 4-6 Light Breeze Small wavelets, crests glassy, no breaking
3 7-10 Gentle Breeze Large wavelets, crests begin to break, scattered whitecaps
4 11-16 Moderate Breeze Small waves 1-4 ft. becoming longer, numerous whitecaps
5 17-21 Fresh Breeze Moderate waves 4-8 ft taking longer form, many whitecaps,
some spray
6 22-27 Strong Breeze Larger waves 8-13 ft, whitecaps common more spray
7 28-33 Near Gale Sea heaps up, waves 13-20 ft, white foam streaks off breakers
8 34-40 Gale Moderately high (13-20 ft) waves of greater length, edges of crests begin
to break into spindrift, foam blown in streaks
9 41-47 Strong Gale High waves (20 ft), sea begins to roll, dense streaks of foam,
spray may reduce visibility
10 48-55 Storm Very high waves (20-30 ft) with overhanging crests, sea white
with densely blown foam, heavy rolling, lowered visibility
11 56-63 Violent Storm Exceptionally high (30-45 ft) waves, foam patches cover sea,
visibility more reduced
12 64+ Hurricane Air filled with foam, waves over 45 ft, sea completely white
with driving spray, visibility greatly reduced

2.6.1 Weber-Fechner scales of subjective experience

Early studies of psychophysics by Weber (1834b,a) and subsequently Fechner (1860) demon-
strated that the human perceptual system does not perceive stimulus intensity as a linear
function of the physical input. The basic paradigm was to compare one weight with another
that differed by amount ∆ , e.g., compare a 10 gram weight with an 11, 12, and 13 gram weight,
or a 10 kg weight with a 11, 12, or 13 kg weight. What was the ∆ that was just detectable?
28 2 A Theory of Data

The finding was that the perceived intensity follows a logarithmic function. Examining the
magnitude of the “just noticeable differece” or JND, Weber (1834b) found that

∆ Intensity
JND = = constant. (2.10)
Intensity
An example of a logarithmic scale of intensity is the decibel measure of sound intensity.
Sound Pressure Level expressed in decibels (dB) of the root mean square observed sound
pressure, Po (in Pascals) is
Po
L p = 20Log10 (2.11)
Pre f
where the reference pressure,Pre f , in the air is 20µPa. Just to make this confusing, the reference
pressure for sound measured in the ocean is 1µPa. This means that sound intensities in the
ocean are expressed in units that are 20 dB higher than those units used on land.
Although typically thought of as just relevant for the perceptual experiences of physical
stimuli, Ozer (1993) suggested that the JND is useful in personality assessment as a way of
understanding the accuracy and inter judge agreement of judgments about other people. In
addition, Sinn (2003) has argued that the logarithmic nature of the Weber-Fechner Law is of
evolutionary significance for preference for risk and cites Bernoulli (1738) as suggesting that
our general utility function is logarithmic.
... the utility resulting from any small increase in wealth will be inversely proportionate to the
quantity of goods already possessed .... if ... one has a fortune worth a hundred thousand ducats
and another one a fortune worth same number of semi-ducats and if the former receives from it a
yearly income of five thousand ducats while the latter obtains the same number of semi-ducats,
it is quite clear that to the former a ducat has exactly the same significance as a semi-ducat to
the latter (Bernoulli, 1738, p 25).

2.6.2 Thurstonian Scalilng

Louis L. Thurstone was a pioneer in psychometric theory and measurement of attitudes,


interests, and abilities. Among his many contributions was a systematic analysis of the process
of comparative judgment (Thurstone, 1927). He considered the case of asking subjects to
successively compare pairs of objects. If the same subject does this repeatedly, or if subjects
act as random replicates of each other, their judgments can be thought of as sampled from a
normal distribution of underlying (latent) scale scores for each object, Thurstone proposed
that the comparison between the value of two objects could be represented as representing
the differences of the average value for each object compared to the standard deviation of
the differences between objects. The basic model is that each item has a normal distribution
of response strength and that choice represents the stronger of the two response strengths
(Figure-2.4). A justification for the normality assumption is that each decision represents the
sum of many independent inputs and thus, through the central limit theorem, is normally
distributed.
Thurstone considered five different sets of assumptions about the equality and indepen-
dence of the variances for each item (Thurstone, 1927). Torgerson expanded this analysis
slightly by considering three classes of data collection (with individuals, between individuals
and mixes of within and between) crossed with three sets of assumptions (equal covariance
2.6 The Scaling of Objects (oi < o j ) 29

of decision process, equal correlations and small differences in variance, equal variances)
(Torgerson, 1958).

0.4
0.5
0.4

0.3
probability of choice
response strength

0.3

0.2
0.2

0.1
0.1
0.0

0.0
−4 −2 0 2 4 −4 −2 0 2 4

latent scale value latent scale value

Fig. 2.4 Thurstone’s model of paired discrimination. Left panel: three items differ in their mean level
as well as their variance. Right panel: choice between two items with equal variance reflects the relative
strength of the two items. The shaded section represents choosing item 2 over item 1.

Thurstone scaling has been used in a variety of contexts, from scaling the severity of crimes
(Coombs, 1964) to the severity of cancer symptoms to help nurses understand their patients
(Degner et al., 1998) or in market research to scale alternative products (Green et al., 1989).
Consider an example of scaling of vegetables, discussed in great detail in Guilford (1954).
Participants were asked whether they preferred vegetable A to vegetable B for a set of nine
vegetables (Turnips, Cabbage, Beets, Asparagus, Carrots, Spinach, String Beans, Peas, and
Corn). This produced the following data matrix where the numbers represent the percentage
of time that the column vegetable was chosen over the row vegetable Guilford (1954).
Just as when scaling individuals (section-2.4), there are several natural ways to convert
these data into scale values. The easiest is simply to find the average preference for each
item. To do this, copy the data table into the clipboard and then using the read.clipboard
function (part of the psych package, create the new variable, veg. The read.clipboard
function will create a data.frame, veg. Although a data.frame looks like a matrix, each
separate column is able to be treated individually and may be of different types of data
(e.g., characters, logical, or numeric). The mean function will report separate means for each
variable in a data frame whereas it will report just the overall mean for a matrix. Were veg
a matrix, mean would report just one value but colMeans would report the mean for each
column.
> data(vegetables) #includes the veg data set
> round(mean(veg),2)
30 2 A Theory of Data

Table 2.8 Paired comparisons of nine vegetables (from Guilford, 1954). The numbers represent the
probability with which the column vegetable is chosen over the row vegetable. Data available in psych
as vegetables

Turn Cab Beet Asp Car Spin S.Beans Peas Corn


Turn 0.500 0.818 0.770 0.811 0.878 0.892 0.899 0.892 0.926
Cab 0.182 0.500 0.601 0.723 0.743 0.736 0.811 0.845 0.858
Beet 0.230 0.399 0.500 0.561 0.736 0.676 0.845 0.797 0.818
Asp 0.189 0.277 0.439 0.500 0.561 0.588 0.676 0.601 0.730
Car 0.122 0.257 0.264 0.439 0.500 0.493 0.574 0.709 0.764
Spin 0.108 0.264 0.324 0.412 0.507 0.500 0.628 0.682 0.628
S.Beans 0.101 0.189 0.155 0.324 0.426 0.372 0.500 0.527 0.642
Peas 0.108 0.155 0.203 0.399 0.291 0.318 0.473 0.500 0.628
Corn 0.074 0.142 0.182 0.270 0.236 0.372 0.358 0.372 0.500

Turn Cab Beet Asp Car Spin S.Beans Peas Corn


0.18 0.33 0.38 0.49 0.54 0.55 0.64 0.66 0.72

Given that the zero point is arbitrary, subtracting the Turnip value from other vegetables
does not change anything:
> veg.t <- mean(veg) - mean(veg[,1])
> round(veg.t,2)
Turn Cab Beet Asp Car Spin S.Beans Peas Corn
0.00 0.15 0.20 0.31 0.36 0.37 0.46 0.48 0.54

If these values were useful, then it should be possible to recreate the rows of the original
matrix (Table 2.8) by taking the differences between any two scale values + .5 (since a
vegetable is preferred over itself 50% of the time, adding .5 to predicted choice). But this will
produce values greater than 1 and less than 0! The predicted probability of choosing Corn
over Turnips would be .5 + .54 = 1.04!. Clearly, this is not a good solution.
Thurstone’s proposed solution was to assign scale values based upon the average normal
deviate transformation of the raw probabilities. This is found by first converting all the
observed probabilities into their corresponding standard normal values, z-scores, and then
finding the average z-score. This is done using the qnorm, as.matrix, and colMeans functions.
(Table 2.9).
Finding the average z-score for each column of Table 2.9 is the equivalent to finding the
least squares solution to the series of equations finding the pairwise distances (Torgerson,
1958). Adding a constant does not change the distances, so by subtracting the smallest to
all the numbers we find the following scale values:
But how well does this set of scale values fit? One way to evaluate the fit is to find the
predicted paired comparisons given the model and then subtract them from the observed.
The model matrix is found by taking the differences of the row and column values for the
items. For example, the modeled value for String Beans vs. Corn is 1.40 - 1.63 or -.23.
Then, convert these modeled values to probabilities (Table 2.11) and then find the residuals
or errors (Table 2.12) by comparing to the original data (Table 2.8).
> modeled <- pnorm(pdif)
> round(modeled,2)
2.6 The Scaling of Objects (oi < o j ) 31

Table 2.9 U
sing the Convert the paired comparison data of Table 2.8 into z-scores by using the qnorm function

> z.veg <- qnorm(as.matrix(veg))


> round(z.veg,2) #see table
> scaled.veg <- colMeans(z.veg)
> scaled <- scaled.veg - min(scaled.veg)
> round(scaled,2)
Turn Cab Beet Asp Car Spin S.Beans Peas Corn
0.00 0.52 0.65 0.98 1.12 1.14 1.40 1.44 1.63

Turn Cab Beet Asp Car Spin S.Beans Peas Corn


Turn 0.00 0.91 0.74 0.88 1.17 1.24 1.28 1.24 1.45
Cab -0.91 0.00 0.26 0.59 0.65 0.63 0.88 1.02 1.07
Beet -0.74 -0.26 0.00 0.15 0.63 0.46 1.02 0.83 0.91
Asp -0.88 -0.59 -0.15 0.00 0.15 0.22 0.46 0.26 0.61
Car -1.17 -0.65 -0.63 -0.15 0.00 -0.02 0.19 0.55 0.72
Spin -1.24 -0.63 -0.46 -0.22 0.02 0.00 0.33 0.47 0.33
S.Beans -1.28 -0.88 -1.02 -0.46 -0.19 -0.33 0.00 0.07 0.36
Peas -1.24 -1.02 -0.83 -0.26 -0.55 -0.47 -0.07 0.00 0.33
Corn -1.45 -1.07 -0.91 -0.61 -0.72 -0.33 -0.36 -0.33 0.00

Table 2.10 All models are approximations to the data. An analysis of residuals is essential to evalu-
ating the goodness of fit of the model. Here Modeled z score differences (residuals) are the differences
of the data compared to the modeled scale values.

> pdif <- - scaled %+% t(scaled)


> colnames(pdif) <- rownames(pdif) <- colnames(z.veg)
> round(pdif,2)

> round(pdif,2)
Turn Cab Beet Asp Car Spin S.Beans Peas Corn
Turn 0.00 0.52 0.65 0.98 1.12 1.14 1.40 1.44 1.63
Cab -0.52 0.00 0.13 0.46 0.60 0.62 0.88 0.92 1.11
Beet -0.65 -0.13 0.00 0.33 0.46 0.49 0.75 0.79 0.98
Asp -0.98 -0.46 -0.33 0.00 0.14 0.16 0.42 0.46 0.65
Car -1.12 -0.60 -0.46 -0.14 0.00 0.03 0.28 0.33 0.51
Spin -1.14 -0.62 -0.49 -0.16 -0.03 0.00 0.26 0.30 0.49
S.Beans -1.40 -0.88 -0.75 -0.42 -0.28 -0.26 0.00 0.04 0.23
Peas -1.44 -0.92 -0.79 -0.46 -0.33 -0.30 -0.04 0.00 0.19
Corn -1.63 -1.11 -0.98 -0.65 -0.51 -0.49 -0.23 -0.19 0.00

> resid <- veg - modeled


> round(resid,2)
These residuals seem small. But how small is small? The mean residual is (as it should
be) 0. Describing the residuals (using the describe function) suggests that they are indeed
small:
Alternatively, using a goodness of fit test (e.g., Equation 2.5) finds that the sum squared
residual of .31 is much less than the sum squared data, 24.86 for a Goodness of Fit of .994.
32 2 A Theory of Data

Table 2.11 Modeled probability of choice based upon the modeled scale values

Turn Cab Beet Asp Car Spin S.Beans Peas Corn


Turn 0.50 0.70 0.74 0.84 0.87 0.87 0.92 0.93 0.95
Cab 0.30 0.50 0.55 0.68 0.72 0.73 0.81 0.82 0.87
Beet 0.26 0.45 0.50 0.63 0.68 0.69 0.77 0.79 0.84
Asp 0.16 0.32 0.37 0.50 0.55 0.57 0.66 0.68 0.74
Car 0.13 0.28 0.32 0.45 0.50 0.51 0.61 0.63 0.70
Spin 0.13 0.27 0.31 0.43 0.49 0.50 0.60 0.62 0.69
S.Beans 0.08 0.19 0.23 0.34 0.39 0.40 0.50 0.52 0.59
Peas 0.07 0.18 0.21 0.32 0.37 0.38 0.48 0.50 0.57
Corn 0.05 0.13 0.16 0.26 0.30 0.31 0.41 0.43 0.50

Table 2.12 Residuals = data - model

Turn Cab Beet Asp Car Spin S.Beans Peas Corn


Turn 0.00 0.12 0.03 -0.03 0.01 0.02 -0.02 -0.03 -0.02
Cab -0.12 0.00 0.05 0.05 0.02 0.00 0.00 0.02 -0.01
Beet -0.03 -0.05 0.00 -0.07 0.06 -0.01 0.07 0.01 -0.02
Asp 0.03 -0.05 0.07 0.00 0.01 0.02 0.01 -0.08 -0.01
Car -0.01 -0.02 -0.06 -0.01 0.00 -0.02 -0.04 0.08 0.07
Spin -0.02 0.00 0.01 -0.02 0.02 0.00 0.03 0.06 -0.06
S.Beans 0.02 0.00 -0.07 -0.01 0.04 -0.03 0.00 0.01 0.05
Peas 0.03 -0.02 -0.01 0.08 -0.08 -0.06 -0.01 0.00 0.05
Corn 0.02 0.01 0.02 0.01 -0.07 0.06 -0.05 -0.05 0.00

Table 2.13 Basic summary statistics of the residuals suggest that they are very small
> describe(resid)
var n mean sd median mad min max range se
Turn 1 9 -0.01 0.05 0.00 0.03 -0.12 0.03 0.15 0.02
Cab 2 9 0.00 0.05 0.00 0.02 -0.05 0.12 0.17 0.02
Beet 3 9 0.00 0.05 0.01 0.04 -0.07 0.07 0.14 0.02
Asp 4 9 0.00 0.04 -0.01 0.03 -0.07 0.08 0.14 0.01
Car 5 9 0.00 0.05 0.01 0.01 -0.08 0.06 0.14 0.02
Spin 6 9 0.00 0.03 0.00 0.03 -0.06 0.06 0.12 0.01
S.Beans 7 9 0.00 0.04 0.00 0.03 -0.05 0.07 0.12 0.01
Peas 8 9 0.00 0.05 0.01 0.06 -0.08 0.08 0.16 0.02
Corn 9 9 0.01 0.04 -0.01 0.02 -0.06 0.07 0.13 0.01

2.6.3 Alternative solutions to the ranking of objects

Just as in the scaling of people in tournaments there were alternative ways to assign scale
values and to evaluate the scale values, so we can consider alternatives to the Thurstone
scale. Although we can not take simple differences between scale values to predict choice,
using Equation 2.7 or Equation 2.9 does allow for alternative solutions. Consider assigning
a constant to all values, rank orders (1-9) or squared rank orders, or even reversed rank
orders (just to be extreme)! Just as we can compare the three ways of scaling people from
tournament outcomes, so can we compare using the Choice model, Thurstone Case V, or
logistic models for fitting alternative scalings of the vegetable data.
2.6 The Scaling of Objects (oi < o j ) 33

First form a data frame made up of the six alternative scaling models (a constant, simple
rank orders of the vegetable choices, squared rank orders, reversed rank orders, frequency of
choice and Thurstonian scale values) and then apply the scaling.fits function from above
to all six scales. What is most interesting is that although the scale values differ greatly,
the Choice model fits almost as well for four of the six scaling solutions. The Thurstone and
logistic fitting techniques differ a great deal between the six methods of forming the scales.

Table 2.14 The Thurstone model is not the only model for vegetable preferences. A simple choice
model does almost as well.

> data(vegetables)
> scaled <- thurstone(veg)$scale
> veg.t <- mean(veg) - mean(veg[,1])
> tests <- c("choice","logit","normal")
> veg.scales.df <- data.frame(constant= rep(.5,9),equal= seq(1,9),squared = seq(1,9)^2,
+ reversed = seq(9,1),raw = veg.t,thurstone=scaled)
> round(veg.scales.df,2)
> fits <- matrix(rep(0, 3*dim(veg.scales.df)[2]), ncol = 3)
> for (i in 1:dim(veg.scales.df)[2]) {
+ for (j in 1:3) {
+ fits[i, j] <- scaling.fits(veg.scales.df[i],rowwise=TRUE, data = as.matrix(veg),
+ test = tests[j])$GF } }
> rownames(fits) <- c("Constant", "Equal","Squared", "Reversed","Choice", "Thurstone")
> colnames(fits) <- c("choice", "logistic", "normal")
> round(fits, 2)

constant equal squared reversed raw thurstone


Turn 0.5 1 1 9 0.00 0.00
Cab 0.5 2 4 8 0.15 0.52
Beet 0.5 3 9 7 0.20 0.65
Asp 0.5 4 16 6 0.31 0.98
Car 0.5 5 25 5 0.36 1.12
Spin 0.5 6 36 4 0.37 1.14
S.Beans 0.5 7 49 3 0.46 1.40
Peas 0.5 8 64 2 0.48 1.44
Corn 0.5 9 81 1 0.54 1.63

choice logistic normal


Constant 0.81 0.81 0.81
Equal 0.99 0.88 0.81
Squared 0.98 0.74 0.74
Reversed 0.40 -0.27 -0.43
Choice 0.97 0.89 0.93
Thurstone 0.97 0.97 0.99

What should we conclude from this comparison? Not that the Thurstone techniques are
useless, but rather that the scaling solutions need to be considered in comparison with al-
ternative hypothetical solutions. That is, just because one procedure yields a very good fit
and makes psychometric sense does not imply that it is necessarily better than a simpler
procedure.
34 2 A Theory of Data

2.6.4 Why emphasize Thurstonian scaling?

Perhaps it seems as if too much emphasis has been given to the ordering of vegetables.
However, the process of model fitting and model testing outlined in the previous section is
similar to the process that needs to be followed in all analyses.
1. Examine the data
2. Specify a model
3. Estimate the model
4. Compare the model to the data
5. Repeat until satisfied or exhausted

2.7 Multiple Dimensional Scaling: Distances between Objects


(|oi − o j | < |ok − ol |)

The tournament rankings of chess players and the frequency of choice of vegetables (or the
severity of crimes) are found by comparing two stimuli, either subjects (2.4) or objects (2.6.2),
to each other. The social networks of friendship groups can be thought of as a representing
distances less than a certain value (|oi − o j | < δ ) (2.5). But to compare these distances to
other distances leads to ordering relationships of distances ((|oi − o j | < |ok − ol |). The typical
application is to order pairs of distances in a multidimensional space. One classic example is
multidimensional scaling of distances based upon the Euclidian distance between two points,
x and y, which, in an n-dimensional space, is

n
Distancexy = ∑ (xi − yi )2 . (2.12)
i=1

Alternative scaling models attempt to fit a monotone function of distance rather than the
Euclidian distance. There are a variety of metric and non-metric algorithms, the basic proce-
dure of the non-metric procedures is fit a monotonically increasing function of distance (e.g.,
their ranks) rather than the distances themselves (Kruskal, 1964).
Consider the airline distances between 11 American cities in Table 2.15 (and found in the
cities dataset). Even considering issues of the spherecity of the globe, it is not surprising
that these can be arranged in a two dimensional space. Using the cmdscale function, and
specifying a two dimensional solution finds the best fitting solution (values for all the cities
on two dimensions):
Representing these cities graphically produces a rather strange figure fig 2.5) . Reversing
both axes produces a figure that is more recognizable (fig 2.6. Using the map function from
the maps package shows that the solution is not quite correct, probably due to the spherical
nature of the real locations.
Extensions of the metric multidimensional scaling procedures fit data where distances are
ordinal rather than interval (e.g, Borg and Groenen (2005); Carroll and Arabie (1980); Green
et al. (1989); Kruskal and Wish (1978)) are known as non-metric multidimensional scaling
and are available in the MASS package as isoMDS and sammon. In addition, some programs are
able to find the best fit for arbitrary values of r for distance in a Minkowski R space (Arabie,
1991; Kruskal, 1964). An r value of 1 produces a city block or Manhattan metric (there are
2.7 Multiple Dimensional Scaling: Distances between Objects (|oi − o j | < |ok − ol |) 35

Table 2.15 Airline distances between 11 American cities taken from the cities data set.
> data(cities)
> cities

ATL BOS ORD DCA DEN LAX MIA JFK SEA SFO MSY
ATL 0 934 585 542 1209 1942 605 751 2181 2139 424
BOS 934 0 853 392 1769 2601 1252 183 2492 2700 1356
ORD 585 853 0 598 918 1748 1187 720 1736 1857 830
DCA 542 392 598 0 1493 2305 922 209 2328 2442 964
DEN 1209 1769 918 1493 0 836 1723 1636 1023 951 1079
LAX 1942 2601 1748 2305 836 0 2345 2461 957 341 1679
MIA 605 1252 1187 922 1723 2345 0 1092 2733 2594 669
JFK 751 183 720 209 1636 2461 1092 0 2412 2577 1173
SEA 2181 2492 1736 2328 1023 957 2733 2412 0 681 2101
SFO 2139 2700 1857 2442 951 341 2594 2577 681 0 1925
MSY 424 1356 830 964 1079 1679 669 1173 2101 1925 0

Table 2.16 Two dimensional representation for 11 American cities.


> city.location <- cmdscale(cities, k=2) #ask for a 2 dimensional solution
> plot(city.location,type="n", xlab="Dimension 1", ylab="Dimension 2",main ="cmdscale(cities)")
> text(city.location,labels=names(cities)) #put the cities into the map
> round(city.location,0) #show the results

[,1] [,2]
ATL -571 248
BOS -1061 -548
ORD -264 -251
DCA -861 -211
DEN 616 10
LAX 1370 376
MIA -959 708
JFK -970 -389
SEA 1438 -607
SFO 1563 88
MSY -301 577

no diagonals), r of 2 is the standard Euclidean, and r values greater than 2 emphasize the
larger distance much more than smaller distances. The unit “circles” for Minkowski values of
1, 2, and 4 may be seen in the example for the minkowski function.

n
Distancexyr = r
∑ (xi − yi )r . (2.13)
i=1

A further example of the use of Multidimensional Scaling is to represent the patterning


of ability variables by MDS rather than component or factor analysis (Chapter 6.8.1). In
that example, MDS, by examining the relative versus absolute distances, effectively removes
the general factor of ability which is represented by all the correlations being positive (Fig-
ure 6.11).
36 2 A Theory of Data

Multidimensional Scaling of 11 cities

MIA
200 400 600

MSY

LAX
ATL
Dimension 2

SFO
DEN
0
-200

DCA
ORD
JFK

BOS
-600

SEA

-1000 -500 0 500 1000 1500

Dimension 1

Fig. 2.5 Original solution for 1 US cities. What is wrong with this figure? Axes of solutions are not
necessarily directly interpretable. Compare to Figure 2.6

2.8 Preferential Choice: Unfolding Theory (|si − o j| < |sk − ol |)

“Do I like asparagus more than you like broccoli?” compares how far apart my ideal vegetable
is to a particular vegetable (asparagus) with respect to how far your ideal vegetable is to
another vegetable (broccoli). More typical is the question of whether you like asparagus
more than you like broccoli. This comparison is between your ideal point (on an attribute
dimension) to two objects on that dimension. Although the comparisons are ordinal, there
is a surprising amount of metric information in the analysis.

2.8.1 Individual Preferences – the I scale

When an individual is asked whether they prefer one object to another, the assumption is
that the preferred object is closer (in an abstract, psychological space) to the person than
is the non-preferred item. The person’s location is known at his or her “ideal point” and the
2.8 Preferential Choice: Unfolding Theory (|si − o j| < |sk − ol |) 37

MultiDimensional Scaling of US cities

SEA
BOS
JFK
ORD DCA

DEN
SFO
ATL
LAX

MSY
MIA

Fig. 2.6 Revised solution for 11 US cities after making city.location <- -city.location and adding
a US map. The correct locations of the cities are shown with circles. The MDS solution is the center
of each label. The central cities (Chicago, Atlanta, and New Orleans are located very precisely, but
Boston, New York and Washington, DC are north and west of their correct locations.

closer an object is to the ideal point, the more it is preferred. Consider the case of how many
children someone would like to have. For the purpose of analysis we limit this to 0, 1, 2, 3,
4, or 5 children.
Suppose we ask each individual in a sample of people how many children they would like
to have. It is likely that the first choices would range from 0 - 5. Then, except for those whose
first choice was either 0 or 5, they are then asked “if you could not have that number, would
you rather have one more or one less?” . For people whose second choice was neither 0 nor 5,
this procedure is then continued with question #3: “If you could not have X (the first choice)
or Y (the second choice), would you rather have (one less than the minimum of X and Y) or
(one more than the maximum of X and Y)” where the questioner replaces X and Y with the
values from the subject.
There are many possible Individual preference orderings (I-scales) that will be single
peaked: For the person who prefers no children, the preference ordering is 012345, while
for the person who would like a very large family the I-scale would be 54321. For someone
38 2 A Theory of Data

whose ideal point is between 2 and 3, the orderings 231450, 234150, 231450, or 231045 are
all possible.
Assign the scale value for no children to be 0 and for 5 children to be 100. Where on this
scale are the values for 1, 2, 3 or 4 children? The naive answer is to assume equal spacing
and give values of 20, 40, 60, and 80. But psychologically, is the difference between 0 and
1 the same as between 4 and 5? This can be determined for individual subjects by careful
examination of their preferential orders.

Table 2.17 Midpoint ordering gives some metric information. Left hand side: If the midpoint (2|3)
comes after (to the right of) the midpoint (0|5) that implies that 3 is closer to 5 than 0 is to 2. Right
hand side: The midpoint (2|3) comes before (0|5) and thus 2 is closer to 0 than 3 is to 5. Similarly,
that 2|5 comes before 3|4 implies that 4 is closer to 5 than 2 is to 3.

0 1 2 3 4 5 0 1 2 3 4 5
0 0|5 5 0 0|5 5
0 2|3 5 0 2|3 5
0 1|2 5 0 1|2 5
0 0|1 4|5 5 0 0|1 4|5 5
0 3|4 5 0 3|4 5
0 2|5 5 0 2|5 5

Consider the ordering 231450 versus the 321045 ordering. For the first person, because 5
is preferred to 0 we can say that the (0|5) midpoint has been crossed (the person is to the
right of that midpoint). But the person prefers 2 to 3, and thus the (2|3) midpoint has not
been crossed. This implies that the distance from 0 to 2 is greater than the distance from 3
to 5. The data from the second person, 321045, because the (2|3) midpoint has been crossed,
but the (0|5) has not been imply that the distance 0 to 2 is less than the distance from 3-5.
(See Table 2.17).

2.8.2 Joint Preferences – the J scale

When multiple preference orderings are examined, they can be partially ordered with respect
to their implied midpoints (Figure 2.7). The highlighted I-scales reflect the hypothesis that
for all subjects, the distance between progressive numbers of children is a deaccelerating
distance.

2.8.3 Partially ordered metrics

The ordering of the midpoints for the highlighted I-scales seen in Figure 2.7 allow distances
in the Joint scale to be partially ordered. The first and last two midpoints provide no infor-
mation, for that order is fixed. But the I-scale 12345 shows that that (0|3), (0|4) and (0|5)
come before (1|2), (1|3), and (1|4) gives a great deal of metric information. In contrast, going
down the other side of the partial orders, the I-scale 321045 shows that (1|3) and (2|3) come
before (0|4) and implies a different set of partial orders (Table 2.18).
2.8 Preferential Choice: Unfolding Theory (|si − o j| < |sk − ol |) 39

012345 01 102345
02
120345
03 12
123045 210345

04 12 03
213045
123405
12 04
13
05 213405
231045
123450 13
05 14
23
12
231405
213450 321045
14 23
05 04
13
234105 321405
231450 05
23 14
05
14 321450
324105
234150
23 14 24
15
324150 342105
234510
24 34
15
23
342150 432105
324510
15 34 05
24
342510 432150

25 34 15

345210 432510

34 25

435210
35
45 453210
543210

Fig. 2.7 Possible I-scales arranged to show ordering of mid-points. The highlighted I-scales reflect an
deaccelerating Joint scale. The labels for each path show the midpoint “crossed” when going from the
first I scale to the second I scale.

2.8.4 Multidimensional Unfolding

Generalizations of Coombs’ original unfolding to the multidimensional case, with both metric
and non metric applications are discussed by de Leeuw (2005). The basic technique is to treat
the problem as a multidimensional scaling problem of an off diagonal matrix (that is to say,
objects by subjects).
40 2 A Theory of Data

Table 2.18 By observing particular I-scales, it is possible to infer the ordering of midpoints, which
in turn allows for inferences about the distances between the objects. The first 5 midpoint orders are
implied by the highlighted I scales in Figure 2.7 while the last three are implied by the I scale (321045).

Midpoint order Distance information


(0|3) < (1|2) 23 < 01
(0|4) < (1|2) 24 < 01
(0|5) < (1|2) 25 < 01
(0|5) < (2|3) 35 < 02
(2|5) < (3|4) 45 < 23
(2|3) < (0|4) 34 < 02
(2|3) < (1|4) 34 < 12
(1|3) < (0|4) 34 < 01

2.9 Measurement of Attitudes and Abilities (comparing si , o j )

The measurement of abilities and attitudes compares single items (objects) to single subjects.
The comparison may be either one of order (for abilities) or one of proximity (for attitudes).
The difference in these two models may be seen in Figures 2.9 and 2.10. Although most
personality inventories are constructed using the abilities model, it has been pointed out
that the ideal point (proximity) model is probably more appropriate (Chernyshenko et al.,
2007).
In the discussion of classic (Chapter 7) and modern test theory (Chapter 8), the impli-
cations of these two ordering models will be discussed in detail. Here it is just discussed in
terms of the Coombs’ models.

2.9.1 Measurement of abilities (si > o j )

The basic model is that for ability = θ and difficulty = δ that

prob(correct|θ , δ ) = f (θ − δ ) (2.14)

That is, as the ability attribute increases, the probability of getting an item correct also
increases, and as the difficulty of an item increases, the probability of passing that item
decreases. This is either the explicit (Chapter 8) or implicit (Chapter ??) model of most
modern test theory and will be discussed in much more detail in those subsequent chapters.

2.9.1.1 Guttman scales

Guttman considered the case where there is no error in the assessment of the item difficulty
or if the items are sufficiently far apart so that the pattern of item response is completely
redundant with the total score Guttman (1950). That is,

prob(correct|θ , δ ) = 1|θ > δ (2.15)


0|θ < δ . (2.16)
2.9 Measurement of Attitudes and Abilities (comparing si , o j ) 41

One example of items which can be formed into a Guttman scale are those from the social
distance inventory developed by Bogardus (1925) to assess social distance which “refers to
the degrees and grades of understanding and feeling that persons experience regarding each
other. It explains the nature of a great deal of their interaction. It charts the character of
social relations.” (Bogardus, 1925, p 299).

Table 2.19 The Bogardus Social Distance Scale is one example of items that can be made to a
Guttman scale
The Bogardus social distance scale gave the following stem with a list of various ethnicities.
“According to my first feeling reactions I would willingly admit members of each race (as a class, and
not the best I have known, nor the worst member) to one or more of the classifications under which I
have placed a cross (x).”

1. Would exclude from my country


2. As visitors only to my country
3. Citizenship in my country
4. To employment in my occupation in my country
5. To my street as neighbors
6. To my club as personal chums
7. To close kinship by marriage

Such items (with a rewording of items 1 and 2) typically will produce a data matrix similar
to that in table 2.9.1.1. That is, if someone endorses item 5, they will also endorse items 1-4.
The scaling is redundant in that for perfect data the total number of items endorsed always
matches the highest item endorsed. With the exception of a few examples such as social
distance or sexual experience, it is difficult to find examples of sets of more than a few items
that meet the scaling requirements for a Guttman scale.

Table 2.20 Hypothetical response patterns for eight subjects to seven items forming a Guttman scale.
For a perfect Guttman scale the total number of items endorsed (rowSums) reflects the highest item
endoresed.

> guttman <- matrix(rep(0,56),nrow=8)


> for (i in 1:7) { for (j in 1:i) {guttman[i+1,j] <- 1}}
> rownames(guttman) <- paste("S",1:8,sep="")
> colnames(guttman) <- paste("O",1:7,sep="")
> guttman
O1 O2 O3 O4 O5 O6 O7
S1 0 0 0 0 0 0 0
S2 1 0 0 0 0 0 0
S3 1 1 0 0 0 0 0
S4 1 1 1 0 0 0 0
S5 1 1 1 1 0 0 0
S6 1 1 1 1 1 0 0
S7 1 1 1 1 1 1 0
S8 1 1 1 1 1 1 1
> rowSums(guttman)
S1 S2 S3 S4 S5 S6 S7 S8
0 1 2 3 4 5 6 7
42 2 A Theory of Data

2.9.1.2 Normal and logistic trace line models

The Guttman representation of equation 2.14 does not allow for error. A somewhat more
relaxed model that does allow for error is the Mokken scale where each item has a different
degree of difficulty (as in the Guttman scale) but some errors are allowed (Mokken, 1971).
More generally, two models of item responding that do allow for error that do not require
different difficulties and have been studied in great detail are the cumulative normal and the
logistic model. Both of these models consider that the probability of being correct on an item
is an increasing function of the difference between the person’s ability (θ ) and the item’s
difficulty (δ ). These two equations are the cumulative normal of θ − δ
� θ −δ
1 u2
prob(correct|θ , δ ) = √ e− 2 du (2.17)
2π − inf

and the logistic function


1
prob(correct|θ , δ ) = . (2.18)
1 + eδ −θ
With addition of a multiplicative constant (1.702) to the difference between δ and θ in
the logistic equation, the two functions are almost identical over the range from -3 to 3
(Figure 2.8).
1
prob(correct|θ , δ ) = . (2.19)
1 + e1.702(δ −θ )
Latent Trait Theory (and the associated Item Response Theory, IRT) tends to use the
equations 2.18 or 2.19 in estimating ability parameters for subjects given a set of known
(or estimated) item difficulties (Figure 2.9). People are assumed to differ in their ability in
some domain and items are assumed to differ in difficulty or probability of endorsement in
that domain. The basic model of measuring ability is equivalent to a high jump competition.
Given a set of bars on a high jump, what is the highest bar that one can jump?
Classical test theory (Chapter 7) may be thought of as a high jump with random height
bars and many attempts at jumping. The total number of bars passed is the person’s score.
Item Response Theory approaches (Chapter 8) recognize that bars differ in height and allow
jumpers to skip lower bars if they are able to pass higher bars. For example, in the math-
ematical ability domain, item difficulties may be ordered from easy to hard, (knowing your
arithmetic facts, knowing long division, basic algebra, differential calculus, integral calculus,
matrix algebra, etc.).
2.9 Measurement of Attitudes and Abilities (comparing si , o j ) 43

1.0
0.8
cumulative normal or logistic

0.6
0.4
0.2
0.0

-3 -2 -1 0 1 2 3

Fig. 2.8 The cumulative normal and the logistic (with a constant 1.702) are almost identical
functions. The code combines a curve for the normal probability with a curve for the logis-
tic function. curve(pnorm(x),-3,3,ylab=”cumulative normal or logistic”) curve(1/(1+exp(-1.702*x)),-
3,3,add=TRUE,lty=4)

2.9.2 Measurement of attitudes (|si − o j | < δ )

The alternative comparison model is one of proximity. This leads to a single peaked function
(Figure 2.10). Some items are more likely to be endorsed the lower the the subject’s attribute
value, some are most likely to be endorsed at moderate levels, and others have an increasing
probability of endorsement. Thus, if assessing neatness, the item “I am a messy person” will
tend to peak at the lowest levels, while the item “My room neatness is about average” will
peak at the mid ranges, and the item “I am a very neat person” will peak at the highest
levels of the attribute. An item at the highest end of the dimension can be modeled using the
ability model, and an item at the lowest level of difficulty can be modeled by reverse scoring
it (treating rejections as endorsements, and treating endorsements as rejections). However,
items in the middle range do not fit the ordering model and need to be modeled with a single
peaked function Chernyshenko et al. (2007).
44 2 A Theory of Data

1.0
0.8
Probability of correct/endoresed

0.6
0.4
0.2
0.0

-3 -2 -1 0 1 2 3

latent attribute

Fig. 2.9 The basic ability model. Probability of endorsing an item, or being correct on an item varies
by item difficulty and latent ability and is a monotonically increasing function of ability.
0.4
0.3
Probability of endorsement

0.2
0.1
0.0

-3 -2 -1 0 1 2 3

latent attribute

Fig. 2.10 Basic attitude model. Probability of endorsing an item is a single peaked function of the
latent attribute and item location.
2.10 Theory of Data: some final comments 45

2.10 Theory of Data: some final comments

The kind of data we collect reflects the questions we are trying to address. Whether we
compare objects to objects, subjects to subjects, or subjects to objects depends upon what is
the primary question. Considering the differences between order and proximity information is
also helpful, for some questions are more appropriately thought of as proximities (unfoldings)
rather than ordering. Finally, that simple questions asked of the subject can yield stable
metric and partially ordered metric information is an important demonstration of the power
of modeling.
In the next chapter we consider what happens if we incorrectly assume metric properties
of our data where in fact the mapping function from the latent to observed is not linear.
Chapter 3
The problem of scale

Exploratory data analysis is detective work–numerical detective work–or counting detective


work–or graphical detective work. A detective investigating a crime needs both tools and under-
standing. If he has no fingerprint powder, he will fail to find fingerprints on most surfaces. If he
does not understand where the criminal is likely to have put his fingers, he will will not look in
the right places. Equally, the analyst of data needs both tools and understanding (p 1: Tukey
(1977))

As discussed in Chapter 1 the challenge of psychometrics is assign numbers to observations in


a way that best summarizes the underlying constructs. The ways to collect observations are
multiple and can be based upon comparisons of order or of proximity (Chapter 2). But given
a set of observations, how best to describe them? This is a problem not just for observational
but also for experimental psychologists for both approaches are attempting to make inferences
about latent variables in terms of statistics based upon observed variables (Figure 3.1).
For the experimentalist, the problem becomes interpreting the effect of an experimental
manipulation upon some outcome variable (path B in Figure 3.1 in terms of the effect of
manipulation on the latent outcome variable (path b) and the relationship between the latent
and observed outcome variables (path s). For the observationalist, the observed correlation
between the observed Person Variable and Outcome variable (path A) is interpreted as a
function of the relationship between the latent person trait variable and the observed trait
variable (path r), the latent outcome variable and the observed outcome variable (path s),
and most importantly for inference, the relationship between the two latent variables (path
a).
Paths r and s are influenced by the reliability (Chapter 7), the validity (Chapter 9) and
the shape of the functions r and s mapping the latents to the observed variables. The problem
of measurement is a question about the shape of these relationships. But before it is possible
to discuss shape it is necessary to consider the kinds of relationships that are possible. This
requires a consideration of how to assign numbers to the data.
Consider the set of observations organized into a data.frame, s.df, in Table 3.1. Copy this
table into the clipboard, and read the clipboard into the data.frame, s.df.1 A data.frame is
an essential element in R and has many (but not all) the properties of a matrix. Unlike a
matrix, the column entries can be of different data types (strings, logical, integer, or numeric).
Data.frames have dimensions (the number of rows and columns), and a structure. To see the
structure of a data.frame (or any other R object, use the str function.

1 Because θ is read as X., we add the command colnames(s.df)[4] <- "theta" to match the table.

47
48 3 The problem of scale

Latent
Person
State
Variable
e
Latent d g h
f Latent
Person
a Outcome
Trait
Variable
Variable
c b

r t s
Experimental
Manipulation

C B

G F H
Observed Observed
Person A Outcome
Variable Variable
D
E

Observed
State
Variable

Fig. 3.1 Both experimental and observational research attempts to make inferences about unobserved
latent variables (traits, states, and outcomes) in terms of the pattern of correlations between observed
and manipulated variables. The uppercase letters (A-F) represent observed correlations, the lower case
letters (a-f) represent the unobserved but inferred relationships. The shape of the mappings from latent
to observed (r, s, t) affect the kinds of inferences that can be made(Adapted from Revelle (2007) )

The read.clipboard function is part of the psych package and makes the default assump-
tion that the first row of the data table has labels for the columns. See ?read.clipboard for
more details on the function.

> s.df <- read.clipboard()


> dim(s.df)
[1] 7 7
> str(s.df)
'data.frame': 7 obs. of 7 variables:
3.1 Four broad classes of scales 49

Table 3.1 Six observations on seven participants


Participant Name Gender θ X Y Z

1 Bob Male 1 12 2 1
2 Debby Female 3 14 6 4
3 Alice Female 7 18 14 64
4 Gina Female 6 17 12 32
5 Eric Male 4 15 8 8
6 Fred Male 5 16 10 16
7 Chuck Male 2 13 4 2

$ Participant: int 1 2 3 4 5 6 7
$ Name : Factor w/ 7 levels "Alice","Bob",..: 2 4 1 7 5 6 3
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 1 1 2 2 2
$ theta : int 1 3 7 6 4 5 2
$ X : int 12 14 18 17 15 16 13
$ Y : num 2 6 14 12 8 10 4
$ Z : int 1 4 64 32 8 16 2

3.1 Four broad classes of scales

The association of numbers with data would seem to be easy but in fact is one of the
most intractable problems in psychology. It would seem that associating a number to a data
point is straight forward and it is (“Alice answered 18 questions correctly, Bob answered
12, Eric 15”) but the inferences associated with these numbers differ depending what these
numbers represent. In the mid-20th century, the assignment of numbers and use of the the
term measurement applied to psychological phenomena led to a acrimonious debate between
physicists and psychologists (Ferguson et al., 1940) that was left unresolved. To the physicists,
measurement is “the assignment of numerals to things so as to represent facts of conventions
about them” (p 340). Although this meaning is clearly what we think of when measuring
mass or distance, it implicitly requires the ability to form ratios. (Something is twice as heavy
as something else, something is three times further away). But the assignment of numbers
to observations in psychology usually does not meet this requirement. In response to the
Ferguson et al. (1940) report, Stevens (1946) proposed what has become the conventional
way of treating numbers in psychology. That is, numbers can be seen as representing nominal ,
ordinal , interval or ratio levels of measurement (Table 3.2). Stevens was responding to the
criticism that psychological scales were meaningless because they were not true measurement.
This controversy over what is a measurement continues to this day with some referring
to the “pathological nature” of psychometrics (Michell, 2000) for ignoring the fundamental
work in measurement theory (Falmagne, 1992; Krantz and Suppes, 1971) associated with
conjoint measurement as advanced by Krantz and Tversky (1971) and others. Falmagne’s
(1992) review is a very nice introduction to the power of measurement theory. Other useful
reviews include the history of measurement Dı́ez (1997) which discusses the important work
of Hölder (1901) in a translation by (Michell and Ernst, 1997).
50 3 The problem of scale

Although foolhardy to summarize volumes of work in a paragraph, a core idea in mea-


surement theory is that a variable may be said to be measured on an interval scale, u, if
particular patterns of comparisons of order are maintained. Consider the dimensions, X and
the operation ≥ on pairs of elements of X. Then (a, b) ≥ (c, d) ⇐⇒ u(a) − u(b) ≥ u(c) − u(d)
which implies (a, b) ≥ (c, d) ⇐⇒ (a, c) ≥ (b, d) (Falmagne, 1992). Extending this idea to the
way that variables may be combined and still be said to be measured on the same metric,
is the basis of conjoint measurement theory (Krantz and Tversky, 1971). A fundamental
conclusion is that u is an interval scale of a, b, p, and q, if (a, p) ≥ (b, p) ⇐⇒ (a, q) ≥ (b, q)
for all p and q. (As is discussed in Chapter 8 it is a violation of this relationship that leads
proponents of the 1PL Rasch model to reject models with more parameters such as the 2PL
or 1PN ).
In psychometrics, some attention has been paid to measurement theory, and indeed the
advantages of item response theory (Chapter 8) compared to classical test theory (Chapter 7)
have been framed in terms of the measurement properties of scales developed with the two
models (but see Cliff (1992) for a concern that not enough attention has been paid). Even
within the IRT approach, the differences between one parameter Rasch models (8.1.1) and
more complicated models are debated in terms of basic measurement properties (Cliff, 1992).
Proponents of measurement theory seem to suggest that unless psychologists use interval
or ratio measures, they are not doing “real” science. But this seems to ignore examples of
how careful observation, combined with theoretical sophistication but with measures no more
complicated than counts and ordinal relationships has led to theories as diverse as evolution
or plate tectonics.

Table 3.2 Four types of scales and their associated statistics (Rossi, 2007; Stevens, 1946) The statistics
listed for a scale are invariant for that type of transformation. The Beaufort wind speed scale is interval
with respect to the velocity of the wind, but only ordinal with respect to the effect of the wind. The
Richter scale of earthquake intensity is a logarithmic scale of the energy released but linear measure
of the deflection on a seismometer. *Note that Stevens lists rank correlations as requiring interval
properties although they are insensitive to monotonic transformations.

Scale Basic operations Transformations Invariant statistic Examples


Nominal equality Permutations Counts Detection
xi = x j Mode Species classification
χ 2 and (φ ) correlation Taxons

Ordinal order Monotonic Median Mhos Hardness scale


xi > x j (homeomorphic) Percentiles Beaufort Wind (intensity)
x’ =f(x) Spearman correlations* Richter earthquake scale
f is monotonic

Interval differences Linear Mean (µ) Temperature (°F, °C)


(Affine) Standard Deviation (σ ) Beaufort Wind (velocity)
(xi − x j ) > (xk − xl ) x’ = a + bx Pearson correlation (r)
Regression (β )

Ratio ratios Multiplication Coefficient of variation ( σµ ) Length, mass, time


(Similiarity) Temperature (°K)
xi xk
xj > xl x’ = bx Heating degree days
3.2 Graphical and numeric summaries of the data 51

3.1.1 Factor levels as Nominal values

Assigning numbers to the “names” column is completely arbitrary, for the names are mere
conveniences to distinguish but not to order the individuals. Numbers could be assigned in
terms of the participant order, or alphabetically, or in some random manner. Such nominal
data uses the number system merely as a way to assign separate identifying labels to each
case. Similarly, the “gender” variable may be assigned numeric values, but these are useful
just to distinguish the two categories. In R, variables with nominal values are considered to be
factors with multiple levels. Level values are assigned to the nominal variables alphabetically
(i.e., “Alice”, although the 3rd participant, is given a level value of “1” for the “names” variable;
similarly, “Females” are assigned a value of “1” on the “Gender” factor).
The “names” and “gender” columns of the data represents “nominal” data (also known
as categorical or in R representing levels of a factor), Columns theta, x, and z are integer
data, and because of the decimal point appearing in column Y, variable Y is assigned as a
“numeric” variable.

3.1.2 Integers and Reals: Ordinal or Metric values?

If the assignment of numbers to nominal data is arbitrary, what is the meaning of the numbers
for the other columns? What are the types of operations that can be done on these numbers
that allow inferences to be drawn from them? To use more meaningful numbers, can we treat
the Mhos scale of hardness values the same way we treat the Indentation hardness values
(refer back to Table 2.6) or the Beaufort scale ratings with wind velocity (Table 2.7)? To
answer these questions, it is useful to first consider how to summarize a set of numbers in
terms of dispersion and central tendency.

3.2 Graphical and numeric summaries of the data

The question is how to best summarize the data without showing all the cases. John Tukey
invented many ways to explore one’s data, both graphically and numerically Tukey (1977).
One descriptive technique was the five number summary which considered the minimum, the
maximum, the median, and then the 25th and 75th percentiles. (These later two are, of course,
just the median number between the minimum and the median, and between the maximum
and the median). The summary function gives these five numbers plus the arithmetic mean.
For categorical (of Type=Factor) variables, summary provides counts. Notice how it orders
the levels of the factor variables alphabetically.
A graphic representation of the Tukey 5 points is the “BoxPlot” drawn by the boxplot
function (Figure 3.2) which includes two more numbers, the upper and lower“whiskers”, which
are defined as the most extreme numbers that do not exceed 1.5 the InterQuartileRange
(IQR) beyond the upper and lower quartiles. Why, you might ask, 1.5? The IQR is the
distance from the 25th to 75th percentile. If the data are sampled from a normal distribution,
the IQR corresponds to 1.35 z units. And 1.5 times that is 2.02. That is, the whiskers will be
2.7 z score units above and below the mean and median. For a normal, this corresponds to
52 3 The problem of scale

Table 3.3 Basic summary statistics from the summary function include Tukey’s “five numbers”.
> s.df <- read.clipboard()
> summary(s.df)
> colnames(s.df)[4] <- "theta"
>boxplot(s.df[,4:7],main="Boxplot of data from Table 3.1")

Participant Name Gender theta X Y Z


Min. :1.0 Alice:1 Female:3 Min. :1.0 Min. :12.0 Min. : 2 Min. : 1.00
1st Qu.:2.5 Bob :1 Male :4 1st Qu.:2.5 1st Qu.:13.5 1st Qu.: 5 1st Qu.: 3.00
Median :4.0 Chuck:1 Median :4.0 Median :15.0 Median : 8 Median : 8.00
Mean :4.0 Debby:1 Mean :4.0 Mean :15.0 Mean : 8 Mean :18.14
3rd Qu.:5.5 Eric :1 3rd Qu.:5.5 3rd Qu.:16.5 3rd Qu.:11 3rd Qu.:24.00
Max. :7.0 Fred :1 Max. :7.0 Max. :18.0 Max. :14 Max. :64.00
Gina :1

roughly the .005 region at each tail, and thus any point beyond the whiskers in either direction
has a .01 chance of occurring. (If the minimum value is less than that distance from the lower
quartile, the whisker ends on the data point, similarly for the upper whisker). Several things
become immediately apparent in this graph: X is much higher than Y (which has more
variability), and z has both greater IQR as well as one very extreme score. Generalizations
of the boxplot are “notched” boxplots which give confidence intervals of the median (use the
“notch” option in boxplot), and “violin” plots which give more graphical representations of
the distributions within the distributions (see vioplot in the vioplot package).

3.2.1 Sorting data as a summary technique

For reasonable size data sets, it is sometimes useful to sort the data according to a meaningful
variable to see if anything leaps out from the data. In this, case, sorting by “name” does not
produce anything meaningful, but sorting by the fourth variable, θ , shows that variables 4-7
are all in the same rank order, a finding that was less than obvious from the original data in
Table 3.1. The concept that “Alabama need not come first” (Ehrenberg, 1977; Wainer, 1978,
1983; Wainer and Thissen, 1981) is a basic rule in table construction and implies that sorting
the data by meaningful variables rather than mere alphabetical or item order will frequently
produce useful findings. Specifying that the new values of the data.frame are to be ordered
by the the rank ordered values of the order function sorts the data frame.

3.3 Numerical estimates of central tendency

Given a set of numbers, what is the single best number to represent the entire set? Unfor-
tunately, although easy to state the question, it is impossible to answer, for the best way
depends upon what is wanted. However, it is possible to say that an unfortunately common
answer, the mode, is perhaps the worst way of estimating the central tendency.
3.3 Numerical estimates of central tendency 53

Boxplot of data from Table 3.1

60
50
40
30
20
10
0

theta X Y Z

Fig. 3.2 The Tukey box and whiskers plot shows the minima, maxima, 25th and 75th percentiles, as
well as the “whiskers” (either the lowest or highest observation or the most extreme value which is no
more than 1.5 times the interquartile range from the box.) Note the outlier on the Z variable.

3.3.1 Mode: the most frequent


~

The mode or modal value represents the most frequently observed data point. This is perhaps
useful for categorical data, but not as useful with ordinal or interval data, for the mode is
particularly sensitive to the way the data are grouped or to the addition of a single new
data point. Consider 100 numbers pseudo randomly generated from 1 to 100 from a uniform
distribution using the runif function. (Alternatively,the sample could have been used to
sample with replacement from a distribution ranging from 1-100). Viewed as real numbers to
10 decimal places, there are no repeats and thus all are equally likely. If we convert them to
integers by rounding (round(x)), table the results, and then sort that table, we find that
the most frequent rounded observation was 39 or 48 which occurred 4 times. (The example
code combines these three commands into one line.) This mode is different when we use the
stem to produce a stem and leaf diagram Tukey (1977) which groups the data by the first
decimal digits. The stem and leaf shows that there were just as many numbers in the 70s
(14) as in the 30s. Breaking the data into 5 chunks instead of 10, leads to the most numbers
being observed between 60 and 80. So, what is the mode?
54 3 The problem of scale

Table 3.4 Sometimes, sorting the data shows relationships that are not obvious from the unsorted
data. Two different sorts are shown, the first, sorting alphabetically by name is less useful than the
second, sorting by variable 4. Note that the sort can either be based upon column number or by column
name. Compare this organization to that of Table 3.1.

> n.df <- s.df[order(s.df[,2]),] #create a new data frame, ordered by the 2nd variable of s.df
> s.df <- s.df[order(s.df$theta),] #order the data frame by the fourth variable (theta)
> sn.df <- cbind(n.df,s.df) #combine the two
> sn.df #show them

Participant Name Gender theta X Y Z Participant Name Gender theta X Y Z


3 3 Alice Female 7 18 14 64 1 Bob Male 1 12 2 1
1 1 Bob Male 1 12 2 1 7 Chuck Male 2 13 4 2
7 7 Chuck Male 2 13 4 2 2 Debby Female 3 14 6 4
2 2 Debby Female 3 14 6 4 5 Eric Male 4 15 8 8
5 5 Eric Male 4 15 8 8 6 Fred Male 5 16 10 16
6 6 Fred Male 5 16 10 16 4 Gina Female 6 17 12 32
4 4 Gina Female 6 17 12 32 3 Alice Female 7 18 14 64

> set.seed(1) #to allow for the same solution each time
> x <- runif(100,1,100) #create 100 pseudo random numbers from 1 to 100.
> # x <- sample(100,100,replace=TRUE)
# Alternatively, take 100 samples from the integers 1 to 100
> sort(table(round(x)))
> stem(x)
> stem(x,scale=.5)
2 3 8 9 11 12 15 18 19 22 30 32 33 38 40 49 52 53 56 58 60 61
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
63 69 70 71 76 81 82 83 84 86 89 90 94 95 96 99 7 13 34 41 42 44
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
46 50 66 67 72 73 77 79 80 87 88 91 21 25 27 35 65 78 39 48
2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4
> stem(x)

0 | 237789
1 | 1233589
2 | 1112555777
3 | 0234455589999
4 | 01122446688889
5 | 002368
6 | 01355566779
7 | 01223367788899
8 | 001234677889
9 | 0114569

> stem(x,.5)

The decimal point is 1 digit(s) to the right of the |


3.3 Numerical estimates of central tendency 55

0 | 2377891233589
2 | 11125557770234455589999
4 | 01122446688889002368
6 | 0135556677901223367788899
8 | 0012346778890114569

The mode is a useful summary statistic for categorical data but should not be used to
summarize characteristics of data that have at least ordinal properties.

3.3.2 Median: the middle observation

A very robust statistic of the central tendency is the median or middle number of the ranked
numbers. For an odd numbered set, the median is that number with as many numbers above
it as below it. For an even number of observations, the median is half way between the
two middle values. A robust estimate is one that has the property that slight changes in
the distribution will lead to small changes in the estimate (Wilcox, 2005). The median is
particularly robust in that monotonic changes in the values of all the numbers above it or
below do not affect the median.
Tukey’s 5 number summaries take advantage of the median, and in addition, define the
lower and upper quartiles as the median distance from the median (see summary). The median
subject will not change if the data are transformed with any monotonic transformation, nor
will the median value change if the data are “trimmed” of extreme scores either by deleting the
extreme scores or by converting all scores beyond a certain value to that value “(winsorizing”
the mean, see winsor).
The median is perhaps the best single description of a set of numbers, for it is that char-
acterization that is exactly above 1/2 and exactly below 1/2 of the distribution. Graphically,
it is displayed as a heavy bar on a box plot (Figure 3.2).
Galton (1874) was a great proponent of the median as an estimate of central tendency for
the simple reason that it was easy to find when taking measurements in the field:
Now suppose that I want to get the average height and “probable error” of a crowd [...]. Measuring
them individually is out of the question; but it is not difficult to range them –roughly for the
most part, but more carefully near the middle and one of the quarter points of the series. Then
I pick out two men, and two only–the one as near near the middle as may be, and the other
near the quarter point, and I measure them at leisure. The height of the first man is the average
of the whole series, and the difference between him and the other man gives the probable error
(Galton, 1874, p 343).

In addition to the technique of lining people up by height to quickly find the median
height, Galton (1899) proposed a novel way of using normal theory to estimate both the
median and the interquartile range:
The problem is representative of a large class of much importance to anthropologists in the field,
few of whom appear to be quick at arithmetic or acquainted even with the elements of algebra.
They often desire to ascertain the physical characteristics of [people] who are too timourous
or suspicious to be measured individually, but who could easily be dealt with by my method.
Suppose it to be a question of strength, as measured by lifting power, and that it has been
ascertained that a per cent. of them fail to lift a certain bag A of known weight, and that b per
cent of them fail to lift an even heavier bag B. From these two data, the median strength can be
56 3 The problem of scale

determined by the simple method spoken of above, and not only it but also the distribution of
strength among the people.

Unfortunately, when the data are grouped in only a few levels (say 4 or 5 response levels on
a teacher rating scale, or by year in school for college students), the median does not give the
resolution needed for useful descriptions of the data. It is more useful to consider that each
number, x, represents the range from x - .5w to x +.5w, where w is the width of the interval
represented by the number. If there are multiple observations with the same nominal value,
they can be thought of as being uniformly distributed across that range. Thus, given the the
two sets of numbers, x and y, with values ranging from 1 to 5 (Table 3.5) the simple median
(the 17th number in these 33 item sets) is 3 in both cases, but the first “3” represents the
lower range of 2.5-3.5 and the second “3” represents the highest part of the same range. Using
linear interpolation and the interp.median function, the interpolated medians are 2.54 and
3.46 respectively. By comparing the results of the summary and interp.quartiles functions,
the distinction is even clearer. The summary output fails to capture the difference between
these two sets of data as well as does the interpolated quartiles results (See Figure 3.3 for
another way of looking at the data.).Note the use of the order function to rearrange the data
and the print function to specify the precision of the answer.

Table 3.5 Finding the median and other quantiles by interpolation gives more precision. Compare
the 1st, 2nd and 3rd Quartiles from the summary function to those found by the interp.quantiles
function. See Figure 3.3 for another perspective.

> x <- c(1,1,2,2,2,3,3,3,3,4,5,1,1,1,2,2,3,3,3,3,4,5,1,1,1,2,2,3,3,3,3,4,2)


> y <- c(1,2,3,3,3,3,4,4,4,5,5,1,2,3,3,3,3,4,4,5,5,5,1,5,3,3,3,3,4,4,4,5,5)
> x <- x[order(x)] #sort the data by ascending order to make it clearer
> y <- y[order(y)]
> data.df <- data.frame(x,y)
> summary(data.df)
> print(interp.quartiles(x),digits=3) #use print with digits to make pretty output
> print(interp.quartiles(y),digits=3)

> x
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 5 5
> y
[1] 1 1 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5
x y
Min. :1.000 Min. :1.000
1st Qu.:2.000 1st Qu.:3.000
Median :3.000 Median :3.000
Mean :2.485 Mean :3.485
3rd Qu.:3.000 3rd Qu.:4.000
Max. :5.000 Max. :5.000
> print(interp.quartiles(x),digits=3) #use print with digits to make pretty output
[1] 1.50 2.54 3.23
> print(interp.quartiles(y),digits=3)
[1] 2.77 3.46 4.50
3.3 Numerical estimates of central tendency 57

X Y

5

4


3

3

value

value

2

2

1

1
0

ordinal position ordinal position

Fig. 3.3 When the data represent just a few response levels (e.g., emotion or personality items, or
years of education), raw medians and quartile statistics fail to capture distinctions in the data. Using
linear interpolation within each response level (interp.quartiles), finer distinctions may be made.
Although both the X and Y data sets have equal medians the data are quite different. See Table 3.5
for the data.

3.3.3 3 forms of the mean

Even though most people think they know what is a mean there are at least three different
forms seen in psychometrics and statistics. One, the arithmetic average is what is most
commonly thought of as the mean.
N
X̄ = X. = ( ∑ Xi )/N (3.1)
i=1

Applied to the data set in Table 3.1, the arithmetic means for the last four variables are
(rounded to two decimals):
>round(mean(s.df[,4:7]),2)
58 3 The problem of scale

theta X Y Z
4.00 15.00 8.00 18.14

Because the mean is very sensitive to outliers, it is sometimes recommended to “trim” the top
and bottom n%. Trimming the top and bottom 20% of the data in Table 3.1 leads to very
different estimates for one of the variables (Z). Another technique for reducing the effect
of outliers is to find the “Winsorized” mean. This involves sorting the data and replacing
all values less than the nth value with the nth value, and all values greater than the N-th
value with the N-th value (Wilcox, 2005). Several packages have functions to calculate the
Winsorized mean, including winsor in the psych package.
>round(mean(s.df[,4:7],trim=.2),2)
>round(winsor(s.df[,4:7],trim=.2),2)
theta X Y Z
4.0 15.0 8.0 12.4
4.00 15.00 8.00 13.71
Another way to find a mean is the geometric mean which is the nth root of the n products
of Xi : �
N
X̄geometric = N
∏ Xi (3.2)
i=1

Sometimes, the short function we are looking for is not available in R, but can be created
rather easily. Creating a new function (geometric.mean) and applying it to the data is such
a case:
> geometric.mean <- function(x, na.rm=TRUE) { exp(mean(log(x))) }
> round(geometric.mean(s.df[4:7]),2)
theta X Y Z
3.38 14.87 6.76 8.00
The third type of mean, the harmonic mean, is the reciprocal of the arithmetic average
of the reciprocals and we can create the function harmonic.mean to calculate it:
N
X̄harmonic = (3.3)
∑Ni=1 1/Xi

> harmonic.mean <- function(x,na.rm=TRUE) { 1/(mean(1/x)) }


> round(harmonic.mean(s.df[4:7]),2)

harmonic.mean(s.df[,4:7])
theta X Y Z
2.70 14.73 5.40 3.53

The latter two means can be thought of as the anti-transformed arithmetic means of
transformed numbers. That is, just as the harmonic is the reciprocal of the average reciprocal,
so is the geometric mean the exponential of the arithmetic average of the logs of Xi :
N
X̄geometric = e(∑i=1 log(Xi ))/N . (3.4)
3.4 The effect of non-linearity on estimates of central tendency 59

The harmonic mean is used in the unweighted means analysis of variance when trying to
find an average sample size. Suppose 80 subjects are allocated to four conditions but for some
reason are allocated unequally to produce samples of size 10, 20, 20, and 30. The harmonic
4 4
cell size = 1/10+1/20+1/20+1/30 = .2333 = 17.14 rather than the 20/cell if they were distributed
equally. Harmonic means are also used when averaging resistances in electric circuits or the
amount of insulation in a combination of windows.
The geometric mean is used when averaging slopes and is particularly meaningful when
looking at anything that shows geometric or exponential growth. It is equivalent to finding
the arithmetic mean of the log transformed data expressed in the original (un-logged) units.
For distributions that are log normally distributed, the geometric mean is a better indicator
of the central tendency of the distribution than is the arithmetic mean. Unfortunately, if any
of the values are 0, the geometric mean is 0, and the harmonic mean is undefined.

3.3.4 Comparing variables or groups by their central tendency

Returning to the data in Table 3.1, the five estimates of central tendency give strikingly
different estimates of which variable is “on the average greater” (Table 3.6). X has the greatest
median, geometric and harmonic mean, while Z has the greatest arithmetic mean, but not the
greatest trimmed mean. Z is a particularly troublesome variable, with the greatest arithmetic
mean and the next to smallest harmonic mean.

Table 3.6 Six estimates of central tendency applied to the data of Table 3.1. The four variables differ
in their rank orders of size depending upon the way of estimating the central tendency.

theta X Y Z
Median 4.00 15.00 8.00 8.00
Arithmetic 4.00 15.00 8.00 18.14
Trimmed 4.00 15.00 8.00 12.40
Winsorized 4.00 15.00 8.00 13.71
Geometric 3.38 14.87 6.76 8.00
Harmonic 2.70 14.73 5.40 3.53

3.4 The effect of non-linearity on estimates of central tendency

Inferences from observations are typically based on central tendencies of observations. But
the inferences can be affected by not just the underlying differences causing these observa-
tions, but the way these observations are taken. Consider the example of psychophysiological
measures of arousal. Physiological arousal is thought to reflect levels of excitement, alertness
and energy. It may be indexed through measures of the head, the heart, and the hand. Among
the many ways to measure arousal are two psychophysiological indicators of the degree of
palmer sweating. Skin conductance (SC) taken at the palm or the fingers is a direct measure
of the activity of the sweat glands of the hands. It is measured by passing a small current
through two electrodes, one attached to one finger, another attached to another finger. The
60 3 The problem of scale

higher the skin conductance, the more aroused a subject is said to be. It is measured in units
of conductance, or mhos. Skin resistance (SR) is also measured by two electrodes, and reflects
the resistance of the skin to passing an electric current. It is measured in units of resistance,
the ohm. The less the resistance, the greater the arousal. These two measures, conductance
and resistance are reciprocal functions of each other.
Consider two experimenters, A and B. They both are interested in the effect of an exciting
movie upon the arousal of their subjects. Experimenter A uses Skin Conductance, experi-
menter B measures Skin Resistance. They first take their measures, and then, after the movie,
take their measures again. The data are shown in Table 3.7. Remember that higher arousal
should be associated with greater skin conductance and lower skin resistance. The means for
the post test indicate a greater conductance and resistance, implying both an increase (as
indexed by skin conductance) and a decrease (as measured by skin resistance)!
How can this be? Graphing the results shows the effect of a non-linear transformation of
the data on the mean (Figure 3.4). The group with the smaller variability (the control group)
has a mean below the straight line connecting the points with the greater variability (the
movie group). The mean conductance and mean resistance for the movie condition is on this
straight line.

Table 3.7 Hypothetical study of arousal using an exciting movie. The post test shows greater arousal
if measured using skin conductance, but less arousal if measured using skin resistance.

Condition Subject Skin Conductance Skin Resistance


Pretest 1 2 .50
2 2 .50
Average 2 .50
Posttest 1 1 1.00
2 4 .25
Average 2.5 .61

3.4.1 Circular Means

An even more drastic transformation of the data that requires yet another way of estimating
central tendency is when the units represent angles and thus can be represented as locations
on a circle. The appropriate central tendency is not the arithmetic mean but rather the
circular mean (Jammalamadaka and Lund, 2006). A typical example in psychology is the
measurement of mood over the day. Energetic arousal, EA, as measured by such self report
items as being alert, wide awake, and not sleepy or tired Rafaeli and Revelle (2006); Thayer
(1989) shows a marked diurnal or circadian rhythm. That is, EA is low but rising in the
morning, peaks sometimes during the early to late afternoon, and then declines at night.
Another example of a phasic rhythm that shows marked individual differences is body tem-
perature Baehr et al. (2000). Consider the hypothetical data in table 3.8 showing the time of
day that each of four scales achieved its maximum for six subjects. (This could be found by
examining the data within each subject for multiple times of day and then finding the maxi-
mum for the scale. A technique for finding the phase angle associated with the maximum will
3.4 The effect of non-linearity on estimates of central tendency 61

2.0
1.5
Resistance

1.0

Movie S1

Mean Movie
0.5

Mean Control
Movie S2

1 2 3 4 5

Conductance

Fig. 3.4 The effect of non-linearity and variability on estimates of central tendency. The movie con-
dition increases the variability of the arousal measures. The “real effect” of the movie is to increase
variability which is mistakenly interpreted as an increase/decrease in arousal.

be discussed later (5.4.3). What is the central tendency for each scale? The simple arithmetic
mean suggests that Tense Arousal achieves its maximum at 12 noon and Negative Affect has
an average maximum at 9 am. But examining the data suggests that midnight and 5 am
are better measures of the central tendency. Using mean.circular from the circular package
or circadian.mean from the psych package converts the angles (expressed in radians for
mean.circular or hours for circadian.mean) to two dimensional vectors (representing the
sin and cosine of the angle), finds the averages for each dimension, and then translates the
average vector back into angles. Note how for the sample mood data in table 3.8, the circular
means correctly capture the change in phase angles between the four moods.
62 3 The problem of scale

Table 3.8 Hypothetical mood data from six subjects for four mood variables. The values reflect the
time of day that each scale achieves its maximum value for each subject. Each mood variable is just
the previous one shifted by 5 hours. Note how this structure is preserved for the circular mean but not
for the arithmetic mean.
Subject Energetic Arousal Positive Affect Tense Arousal Negative Affect
1 9 14 19 24
2 11 16 21 2
3 13 18 23 4
4 15 20 1 6
5 17 22 3 8
6 19 24 5 10
Arithmetic Mean 14 19 12 9
Circular Mean 14 19 24 5

3.5 Whose mean? The problem of point of view

Even if the arithmetic average is used, finding the central tendency is not as easy as just
adding up the observations and dividing by the total number of observations (Equation 3.1).
For it is important to think about what is being averaged. Incorrectly finding an average can
lead to very serious inferential mistakes. Consider two examples, the first is how long people
are in psychotherapy, the second is what is the average class size in particular department.

3.5.1 Average length of time in psychotherapy

A psychotherapist is asked what is the average length of time that a patient is in therapy.
This seems to be an easy question, for of the 20 patients, 19 have been in therapy for between
6 and 18 months (with a median of 12) and one has just started. Thus, the median client is
in therapy for 52 weeks with an average (in weeks) (1 * 1 + 19 * 52)/20 or 49.4.
However, a more careful analysis examines the case load over a year and discovers that
indeed, 19 patients have a median time in treatment of 52 weeks, but that each week the
therapist is also seeing a new client for just one session. That is, over the year, the therapist
sees 52 patients for 1 week and 19 for a median of 52 weeks. Thus, the median client is in
therapy for 1 week and the average client is in therapy of ( 52 * 1 + 19 * 52 )/(52+19) =
14.6 weeks.
A similar problem of taking cross sectional statistics to estimate long term duration have
been shown in measuring the average length of time people are on welfare (a social worker’s
case load at any one time reflects mainly long term clients, but most clients are on welfare for
only a short period of time). Situations where the participants are self weighted lead to this
problem. The average velocity of tortoises and hares passing by an observer will be weighted
towards the velocity of hares as more of those pass by, even though the overall velocity of
both tortoises and hares is much less.
3.6 Non-linearity and interpretation of experimental effects 63

3.5.2 Average class size

Consider the problem of a department chairman who wants to recruit faculty by emphasizing
the smallness of class size but also report to a dean how effective the department is at meeting
its teaching requirements. Suppose there are 20 classes taught by a total of five different
faculty members. 12 of the classes are of size 10, 4 of size 20, 2 of 100, one of 200, and one of
400. The median class size from the faculty member point of view is 10, but the mean class
size to report to the dean is 50!
But what seems like a great experience for students, with a median class size of 10, is
actually much larger from the students’ point of view, for 400 of the 1,000 students are in a
class of 400, 200 are in a class of 200, 200 are in classes of 100, and only 80 are in classes of 20,
and 120 are in class sizes of 10. That is, the median class size from the students’ perspective
is 200, with an average class size of (10*120+ 20*80 + 200*100 + 200*200 + 400* 400)/1000
= 222.8.

Table 3.9 Average class size depends upon point of view. For the faculty members, the median of 10
is very appealing. From the Dean’s perspective, the faculty members teach an average of 50 students
per calls.

Faculty Freshman/ Junior Senior Graduate Mean Median


Member Sophmore
A 20 10 10 10 12.5 10
B 20 10 10 10 12.5 10
C 20 10 10 10 12.5 10
D 20 100 10 10 35.0 15
E 200 100 400 10 177.5 150
Total
Mean 56 46 110 10 50.0 39
Median 20 10 10 10 12.5 10

Table 3.10 Class size from the students’ point of view. Most students are in large classes; the median
class size is 200 with a mean of 223.
Class size Number of classes number of students
10 12 120
20 4 80
100 2 200
200 1 200
400 1 400

3.6 Non-linearity and interpretation of experimental effects

Many experiments examining the effects of various manipulations or interventions on subjects


differing in some way are attempts at showing that manipulation X interacts with personality
64 3 The problem of scale

dimension Y such that X has a bigger effect upon people with one value of Y than another
(Revelle, 2007; Revelle and Oehleberg, 2008). Unfortunately, without random assignment
of subjects to conditions, preexisting differences between the subjects in combination with
non-linearity of the observed score-latent score relationship can lead to interactions at the
observed score level that do not reflect interactions at the latent score level.
In a brave attempt to measure the effect of a liberal arts education, Winter and McClel-
land developed a new measure said to assess the “the ability to form and articulate complex
concepts and then the use of these concepts in drawing contrasts among examples and in-
stances in the real world” (p 9). Their measure was to have students analyze the differences
between two thematic apperception protocols. Winter and McClelland compared freshman
and senior students at a ”high-quality, high prestige 4 year liberal arts college located in New
England” (referred to as “Ivy College”) with those of “Teachers College”, which was a “4-year
state supported institution, relatively nonselective, and enrolling mostly lower-middle-class
commuter students who are preparing for specific vocations such as teaching”. They also
included students from a “Community College” sample with students similar to those of
“Teachers Colllege”. Taking raw difference scores from freshman year to senior year, they
found much greater improvement for the students at “Ivy College” and concluded that “The
liberal education of Ivy College improved the ability to form and articulate concepts, sharp-
ened the accuracy of concepts, and tended to fuse these two component skills together” (p
15). That is, that the students learned much more at the more prestigious (and expensive)
school Winter and McClelland (1978). While the conclusions of this study are perhaps dear
to all faculty members at such prestigious institutions, they suffer from a serious problem.
Rather than reproducing the data from Winter and McClelland (1978) consider the left
panel of Figure 3.5. The students at “Ivy College” improved more than did their colleagues
at “Teachers College” or the “Junior College. When shown these data, most faculty members
explain them by pointing out that well paid faculty at prestigious institutions are better
teachers. Most students explain these results as differences in ability (the “rich get richer”
hypothesis) or bright students are more able to learn complex material than are less able
students.
However, when given a hypothetical conceptual replication of the study, but involving
mathematics performance, yielding the results shown in the right hand panel of Figure 3.5,
both students and faculty members immediately point out that there is a ceiling effect on
the math performance. That is, the bright students could not show as much change as the
less able students because their scores were too close to the maximum.
What is interesting for psychometricians, of course, is that both panels are generated
from the exact same monotonic curve, but with items of different difficulties. Consider equa-
tion 2.18 which is reproduced here:

1
prob(correct|θ , δ ) = . (3.5)
1 + eδ −θ
Let the ability parameter, θ , take on different values for the three colleges, (JC = -1, TC =
0, IC = 1), let ability increase 1 unit for every year of schooling, and set the difficulty for the
writing at 4 and for the math at 0. Then equation 3.5 is able to produce both the left panel
(a hard task) or the right panel (an easy task). The appearance of an interaction in both
panels is real, but it is at the observed score level, not at the latent level. For the different
slopes of the lines reflect not an interaction of change in ability as a function of college, for
3.6 Non-linearity and interpretation of experimental effects 65

Writing Math

1.0

1.0
Ivy

0.8

0.8
TC
0.6

0.6
Performance

Performance
Ivy
0.4

0.4
JC

TC
0.2

0.2
JC
0.0

0.0
0 1 2 3 4 0 1 2 3 4

time in school time in school

Fig. 3.5 The effect of four years of schooling upon writing and mathematics performance. More
selective colleges produce greater change in writing performance than do teacher colleges or junior
colleges, but have a smaller effect on improvement in math performance.

at the latent, θ , level, one year of schooling had an equal effect upon ability (an increase of
1 point) for students at all three colleges and for either the writing or the math test.
This example is important to consider for it reflects an interpretive bias that is all to
easy to have: if the data fit one’s hypothesis (e.g., that smart students learn more), interpret
that result as confirming the hypothesis, but if the results go against the hypothesis (smart
students learn less), interpret the results as an artifact of scaling (in this case, a ceiling effect).
The moral of this example is that when seeing fan-fold interactions such as in Figure 3.5,
do not interpret them as showing an interaction at the latent level unless further evidence
allows one to reject the hypothesis of non-linearity.
Other examples of supposed interactions that could easily be scaling artifacts include
stage models of development (children at a particular stage learn much more than children
below or above that stage; the effect of hippocampal damage on short term versus long
term memory performance, and the interactive effect on vigilance performance of time on
task with the personality dimension of impulsivity. In general, without demonstrating a
linear correspondence between the latent and observed score, main effects (Figure 3.4) and
interactions (Figure 3.5) are open to measurement artifact interpretations (Revelle, 2007).
66 3 The problem of scale

3.6.1 Linearity, non-linearity and the properties of measurement

It is these problems in interpretation of mean differences that has been the focus of work on
the fundamentals of measurement (Krantz and Suppes, 1971) and that was the basis of the
controversy in the 1930s (Ferguson et al., 1940). Stevens (1946) proposal to consider four
levels of psychological measures suggested that to compare means it was necessary to have
at least interval levels of measurement and that without such measurement qualities, we are
restricted to comparisons of medians. The problem of interpretation considered in Figure 3.5
does not occur if the discussion is in terms of medians, for in that case, the effect of a year
in schooling is a monotonic increase for all three institutions and there is no possibility of
saying that one group changed more than another group.
Comparisons using scales developed using the Rasch model have been claimed by some
(Bond and Fox, 2007; Borsboom, 2005; Borsboom and Scholten, 2008) to offer the interval
quality of measurement required for the comparisons of means using the principles of conjoint
measurement (Krantz and Tversky, 1971) although others strongly disagree (Kyngdon, 2008;
Michell, 2000, 2004) and yet others remain strongly undecided (Reise and Waller, 2009). The
pragmatic advice is to be very careful about interpreting ordinal interactions or any effect
that can go away with a monotonic transformation and to look for disordinal interactions or
effects that remain even after extreme but monotonic transformations.

3.7 Measures of dispersion

In addition to describing a data set with a measure of central tendency, it is important to


have some idea of the amount of dispersion around that central value.

3.7.1 Measures of range

Perhaps the most obvious measure of dispersion is the range from the highest to the lowest.
Unfortunately, range partly reflects the size of a sample, for as the sample size increases,
the probability of observing at least one rare (extreme) event will increase as well (The
probabiity of the extreme event has not changed, but given more observations, the probability
of observing at least one increases.) This is shown in the left panel of Figure 3.6 for samples
of size 2 to 106 . The range (the difference between the maximum and minimum values)
increases dramatically with sample size. One important use of the range is detect data entry
errors. For if the largest possible value should be 9 and an occasional 99 is discovered, it is
likely that a mistake has occurred. Thus, finding the max and min of the data is useful, but
normally just as a way of checking errors.
A more useful measure of range is the interquartile range, that is the range from the 25th
percentile to the 75th percentile. As seen in the right panel of Figure 3.6, the interquartile
range barely varies with sample above about 32. Here the range is expressed in raw score
units. The IQR function can be used to find the interquartile range. For normal data, the IQR
should be the twice the normal score of the 75th percentile = 2 *qnorm(.75) = 1.348980.
3.7 Measures of dispersion 67

minimum and maximum 1st and 3rd Quartiles

1.0
4

0.5
25th and 75th percentiles
2
minima and maxima

0.0
0
-2

-0.5
-4

-1.0
2 4 6 8 10 12 14 2 4 6 8 10 12 14

log of sample size log of sample size

Fig. 3.6 Left hand panel: The minimum and maximum of a sample will generally get further apart
as the sample size increases. Right hand panel: The distance between the 25th and 75th percentile
(the interquartile range) barely changes as sample size increases. Data are taken from random normal
distributions of sample sizes of 2 to 220 . Sample size is log transformed.

In that 50% of the observations will be between the lower and upper quartile, Galton
(1888) took 1/2 of the interquartile range as a measure of the probable error . That is, for
any set of numbers with median, M, the interval M - .5 * IQR to M + .5 IQR will include
half of the numbers.
This unit is known by the uncouth and not easily justified names of ‘probable error,’ which I
suppose is intended to express the fact that the number of deviations or ‘Errors’ in the two outer
fourths of the series is the same as those in the middle two fourths; and therefore the probabilty
is equal that an unknown error will fall into either of these two great halves, the outer or the
inner. (Galton, 1908, Chapter XX. Heredity)

3.7.2 Average distance from the central tendency

Given some estimate of the “average” observation (where the average could be the median,
the arithmetic mean, the geometric mean, or the harmonic mean), how far away is the average
participant? Once again, there are multiple ways of answering this question.
68 3 The problem of scale

3.7.2.1 Median absolute deviation from the median

When using medians as estimates of central tendencies, it is common to also consider the
median absolute distance from the median mad. That is, median(abs(X-median(X)). The
mad function returns the appropriate value. For consistency with normal data, by default
the mad function is adjusted for the fact that it is systematically smaller than the standard
deviation (see below) by a factor of 1/qnorm(.75). Thus, the default is to return the median
absolute deviation * 1.4826. With this adjustment, if the data are normal, then the mad and
sd function will return almost identical values. If, however, the data are not normal, but
contain some particularly unusual data points (outliers), the mad will be much less than the
sd (see the discussion of robust estimators of dispersion at 3.14).

3.7.2.2 Sums of squares and Euclidean distance

A vector X with n elements can be thought of as a line in n dimensional space. Generalizing


Pythagorus to n dimensions, the length of that line in Euclidean space will be the square
root of the sum of the squared distances along each of the n � dimensions (remember that
Pythagorus showed that for two dimensions c2 = a2 + b2 or c = (a2 + b2 )).
To find the sums of squares of a vector X we multiply the transpose of the vector (X T )
times the vector (X):
n
SS = SumSquares = ∑ (Xi2 ) = X T X (3.6)
i=1

If X is a matrix, then the Sum Squares will be the diagonal of the X T X matrix product.
Letting X be the matrix formed from the last 4 variables from Table 3.1:
> X <- as.matrix(s.df[,4:7])
> SS <- diag(t(X)%*% X)
> X
theta X Y Z
1 1 12 2 1
7 2 13 4 2
2 3 14 6 4
5 4 15 8 8
6 5 16 10 16
4 6 17 12 32
3 7 18 14 64
> SS
theta X Y Z
140 1603 560 5461

3.7.3 Deviation scores and the standard deviation

Rather than considering the raw data (X), it is more common to transform the data by
subtracting the mean from all data points.
3.7 Measures of dispersion 69
n
deviationscorei = xi = Xi − X. = Xi − ∑ (Xi )/n (3.7)
i=1

Finding the Sums of Squares or length of this vector is done by using equation 3.6, and
for a data matrix, the SS of deviation scores will be xT x. If the SS is scaled by the number
of observations (n) or by the number of observations -1 (n-1), it becomes a Mean Square, or
Variance. The variance is the second moment around the mean:
n
σ 2 = ∑ (xi2 )/(n − 1) = xT x/(n − 1) (3.8)
i=1

Taking the square root of the Variance converts the numbers in the original units and is a
measure of the length of the vector of deviations in n-dimensional space. The term variance
was introduced as the squared standard deviation by William Sealy Gossett publishing under
the name of “Student” (Pearson, 1923).
> X <- as.matrix(s.df[,4:7])
> c.means <- colMeans(X)
> X.mean <- matrix(rep(c.means,7),byrow=TRUE,nrow=7)
> x <- X - X.mean
> SS <- diag(t(x)%*% x)
> x.var <- SS/(dim(x)[1]-1)
> x.sd <- sqrt(x.var)
> SS
theta X Y Z
28.000 28.000 112.000 3156.857
> x.var
theta X Y Z
4.666667 4.666667 18.666667 526.142857
> x.sd
theta X Y Z
2.160247 2.160247 4.320494 22.937804
As would be expected, because the operation of finding the sums of squares of deviations
from the mean is so common, rather than doing the matrix operations shown above, functions
for the standard deviation and the variance are basic functions in R. sd returns the standard
deviation of a vector or each column of a data frame, var returns the variance and covariances
of each column of a data frame or of a matrix.
Deviation scores are in the same units as the original variables, but sum to zero.

3.7.4 Coefficient of variation

Particularly when using values that have the appearance of ratio measurement (e.g., dollars,
reaction times, micro liters of a biological assay) an index of how much variation there is
compared to the mean level is the coefficient of variation. This is simply the ratio of the
standard deviation to the sample mean. Although not commonly seen in psychometrics, the
70 3 The problem of scale

CV will be seen in biological journals (reporting the error of the assay), financial reports as
well as manufacturing process control situations.
σx
CV = (3.9)

3.8 Geometric interpretations of Variance and Covariance

It is sometimes useful to think of data geometrically. A set of n scores on a single variable


may be thought of geometrically as representing a vector in n-dimensional space where the
dimensions of the space represent the individual scores. Center this vector on the grand mean
(i.e., convert the scores to deviation scores). Then the length of this vector is the square root
of the sums of squares and the average length of this vector across all dimensions is the
standard deviation.
Another measure of dispersion is the average squared distance between the n data points.
This is found by finding all n2 pairwise distances, squaring them, and then dividing by n2 .
But since the diagonal of that matrix is necessarily zero, it is more appropriate to divide by
n*(n-1). This value is, it turns out, twice the variance. Remembering that standard deviation
is the square√ root of the variance, we find that the average distance between any two data
points is σx 2.
Why is this? Consider the matrix of distances between pairs of data points:
 
0 X1 − X2 ... X1 − Xn
 X2 − X1 0 ... Xn − X2 
 
 ... ... 0 ... 
Xn − X1 Xn − X2 ... 0
Square each element:
 
0 X12 + X22 − 2X1 X2 ... X12 + Xn2 − 2X1 Xn
 X 2 + X 2 − 2X1 X2 0 ... X22 + Xn2 − 2X2 Xn 
 1 2 .
 ... ... 0 ... 
X12 + Xn2 − 2X1 Xn X22 + Xn2 − 2X2 Xn ... 0

Sum all of these elements to obtain


n n n n
∑ di2 = 2n ∑ Xi2 − 2 ∑ ∑ Xi X j (3.10)
i=1 i=1 i=1 j=1

The average squared distance may be obtained by dividing the total squared distance by
n2 (to obtain a population variance) or by n(n − 1) to obtain the sample estimate of the
variance.
n n n
d¯2 = 2( ∑ Xi2 − ∑ ∑ Xi X j )/n)/(n − 1) (3.11)
i=1 i=1 j=1

But this is just the same as


3.9 Variance, Covariance, and Distance 71
n n n
2( ∑ Xi2 − ∑ Xi X. )/(n − 1) = 2( ∑ Xi2 − nX.2 )/(n − 1) (3.12)
i=1 i=1 i=1

which is twice the variance:

n n
σ 2 = xT x/(n − 1) = (X − X. )T (X − X. )/(n − 1) = ∑ (Xi − X. )2 /(n − 1) = ( ∑ Xi2 − nX.2 )/(n − 1)
i=1 i=1
√ (3.13)
That is, the average distance between any two data points will be σx 2. Knowing the
standard deviation allows us to judge not just how likely a point is deviate from the mean,
but also how likely two points are to differ by a particular amount. integrate this section
with previous sec-
tion
3.9 Variance, Covariance, and Distance

There are a variety of ways to conceptualize variance and covariance. Algebraically, for a
vector X with elements Xi , variance is the average of the sum of squared distances from the
mean ,X. , (Equation 3.13) or alternatively, 1/2 of the average squared distance between any
two points (Equation 3.12). For two vectors, X1 and X2 , the covariance between them may
be similarly defined as the average product of deviation scores:

n n n n n
Cov12 = ∑ x1i x2i /n = ∑ (X1i − X1. )(X2i − X2. )/n = { ∑ X1i X2i − ∑ X1i ∑ X2i /n}/n. (3.14)
i=1 i=1 i=1 i=1 i=1

A spatial interpretation of covariance may be expressed in terms of the average distance


between the corresponding points in X1 and X2 . For simplicity, express each vector in terms
of deviations from the respective means: x1i = X1i − X1.

∑ni=1 (x1i − x2i )2 ∑n (x2 + x2i


2 − 2x x )
1i 2i
2
dist12 = = i=1 1i = Var1 +Var2 − 2Cov12 (3.15)
n n
That is, the covariance is the difference between the average of the variances of each vector
(which are themselves just twice the average squared distances between each point on a
vector) and half the average squared distance between the corresponding pair of points on
each vector.
2 }
{Var1 +Var2 − dist12 {Var1 +Var2 } dist12 2
Cov12 = = − . (3.16)
2 2 2
If each element of X1 is the same as each element of X2 , then the pairwise distances are
zero, the two variance are identical, and the covariance is the same as the variance.
72 3 The problem of scale

3.10 Standard scores as unit free measures

In some fields, the unit of measurement is most important. In economics, a basic unit could
be the dollar or the logarithm of the dollar. In education the basic unit might be years of
schooling. In cognitive psychology the unit might be the millesecond. A tradition in much
of individual differences psychology is to ignore the units of measurement and to convert
deviation scores into standard scores. That is, to divide deviation scores by the standard
deviation:

zi = xi /σx = (X − X.)/ VarX (3.17)
One particularly attractive feature of standard scores is that they have mean of 0 and
standard deviation and variance of 1. This makes some derivations easier to do because
variances or standard deviations drop out of the equations. A disadvantage of standard scores
is communicating the scores to lay people. To be told that someone’s son or daughter has a
score of -1 is particularly discouraging. To avoid this problem (and to avoid the problem of
decimals and negative numbers in general) a number of transformations of standard scores
are used when communicating to the public. They are all of the form of multiplying the zi
scores by a constant and then adding a different constant (Table 3.11). The rescale function
does this by using the scale function to first convert the data to z scores, and then multiplies
by the desired standard deviation and adds the desired mean (see Figure 3.8).

Table 3.11 Raw scores (Xi ) are typically converted into deviation scores (xi ) or standard scores (zi ).
These are, in turn, the transformed into “public” scores for communication to laypeople.

Transformation Mean Standard Deviation



X. = ∑ (Xi )/n sx = ∑ (X 2
Raw Data �i − X.) /(n − 1)
deviation score xi = Xi − X. 0 sx = ∑ (xi )2 /(n − 1)
standard score zi = xi /sx 0 1
“IQ” zi *15+100 100 15
“SAT” zi *100+500 500 100
“ACT” zi *6+18 18 6
“T-score” zi *10+ 50 50 10
“Stanine” zi *2.0+5 5 2.0

3.11 Assessing the higher order moments of the normal and other
distributions

The central limit theorem shows that the distribution of the means of independent identically
distributed samples with finite means and variances will tend asymptotically towards the
normal distribution originally described by DeMoivre in 1733, by Laplace in 1774 and Gauss
in 1809 and named by Galton (1877) and others discussed by Stigler (1986). The equation
for the normal curve expressed in terms of the mean and standard deviation is
3.11 Assessing the higher order moments of the normal and other distributions 73

1 (x−µ)2

f (x, µ, σ ) = N(µ, σ ) = √ e 2σ 2 . (3.18)
2πσ
Three normal curves, differing in their mean and standard deviation (i.e, N(0, 1), N(0, 2)
and N(1, 2)) are shown in Figure 3.7. Although typically shown in terms of N(0, 1), alternative
scalings of the normal seen in psychology and psychometrics have different values of the mean
and standard deviation (Table 3.11 and Figure 3.8) partly in order to facilitate communication
with non-statisticians, and partly to obscure the meaning of the scores.

Three normal curves


0.8

N(0,1)
N(1,.5)
N(0,2)
0.6
Probability density

0.4
0.2
0.0

-4 -2 0 2 4

Fig. 3.7 Normal curves can differ in their location (mean) as well as width (standard deviation).
Shown are normals with means of 0 or 1 and standard deviations of 1 or 2.

In addition to its mathematical simplicity, the normal distribution is seen in many settings
where the accumulation of errors is random (e.g., astronomical observations) or made up of
many small sources of variance (the distribution of height among Belgian soldiers as described
by Quetelet in 1837 (Stigler, 1999). Unfortunately, real data rarely are so easily described.
Karl Pearson (1905) made this distinction quite clearly:
The chief physical differences between actual frequency distributions and the Gaussian theoretical
distributions are:
74 3 The problem of scale

Alternative scalings of the normal curve

0.4
0.3
Probability of z
0.2
0.1
0.0

z -3 -2 -1 0 1 2 3
percentile

0.1 2 16 50 84 98 99.9
IQ

55 70 85 100 115 130 145


SAT

200 300 400 500 600 700 800


stanine

1 3 5 7 9
Galton/Tukey

lower fence Q1 Median Q3 Upper Fence

Fig. 3.8 The normal curve may be expressed in standard (z) units with a mean of 0 and a standard
deviation of 1. Alternative scalings of the normal include “percentiles” (a non linear transformation
of the z scores), “IQ” scores with a mean of 100, and a standard deviation of 15, “SAT/GRE” scores
with a mean of 500 and a standard deviation of 100, “ACT” scores with a mean of 18 and a standard
deviation of 6, or “standardized nines - stanines” with a mean of 5 and a standard deviation of 2. Note
that for stanines, each separate score refer to the range from -.5 to +.5 from that score. Thus, the 9th
stanine includes the z-score region from 1.75z and above and has 4% of the normal population. The
5 numbers of the box plot correspond to the lower whisker, 25th, 50th and 75th percentiles, and the
upper whisker.

(i) The significant separation between the mode of position of maximum frequency and the
average or mean character.
(ii) The ratio of this separation between mean and mode to the variability of the character–a
quantity I have termed the skewness.
(iii) A degree of flat-toppedness which is greater or less than that of the normal curve. Given two
frequency distributions which have the same variability as measured by the standard deviation,
they may be relatively more or less flat-topped than the normal curve. If more flat-topped I
term them platykurtic, if less flat-topped leptokurtic, and if equally flat-topped mesokurtic. A
frequency distribution may be symmetrical, satisfying both the first two conditions for normality,
but it may fail to be mesokurtic, and thus the Gaussian curve cannot describe it. (Pearson, 1905,
p 173).
3.11 Assessing the higher order moments of the normal and other distributions 75

Just as the variance is the second moment around the mean and describes the width of
the distribution, so does the skew (the third moment) describe the shape and the kurotosis
(the fourth moment) the peakedness versus flatness of the distribution. Pearson (1905)
√ n √ n
µ3 n ∑i=1 xi )3 n ∑i=1 (Xi − X.)3
skew = γ1 = 3 = n = (3.19)
σ (∑i=1 (xi2 )3/2 (∑ni=1 (Xi − X.)2 )3/2

The standard error of skew is �


6
σγ1 = (3.20)
N
Distributions with positive skew have long right tails while those with negative skew have
long left tails. Examples of positively skewed distribution are common in psychological mea-
sures such as reaction time Ratcliff (1993) or measures of negative affect Rafaeli and Revelle
(2006). As we shall see later (Chapter 4), differences in skew are particularly important in
the effect they have on correlations. Positively skewed reaction time data are sometimes
modeled as log normal distributions or sometimes as Weibull distribtions. Just as the normal
represents the sum of Independently and Identically Distributed random variables (IIDs),
so does the log normal represent the product of IIDs. Such a positively skewed distribution
that is commonly seen in economics is the log normal distribution which can reflect a nor-
mal distribution of multiplicative growth rates (Figure 3.9) and is seen in the distribution
of income in the United States. That is, if the percentage raise given employees is normally
distributed, the resulting income distribution after several years of such raises will be log
normal. Cognitive processes operating in a cascade can also be thought of in terms of the
log normal distribution. Estimating the central tendency of skewed distributions is particu-
larly problematic, for the various estimates discussed earlier will differ drastically. Consider
the curve generated using the dlnorm function set with a log mean of 10.5 and a sd of .8.
These values were chosen to give a rough example of the distribution of family income in
the US which in 2008 had a median of $50,302, a mean of $68,204 and a trimmed mean of
$56,720. (See the income data set for the data). An even more drastic curve is the power law
( f (n) = K/na ) summarizing the distribution of publications of Ph.Ds. with a mode of 0 and
an upper range in the 1,000s (Anderson et al., 2008; Lotka, 1926; Vinkler, 2007).
Platykurtic distributions (kurtosis > 0) have more of their density in the center of the
distribution than would be expected given the magnitude of their standard deviations. Lep-
tokurtic distributions, on the other hand, have “fatter tails” than would be expected given
their standard deviation Pearson (1905). (“Student” introduced the mnemonic that a platy-
pus has a short tail and that kangaroos, who are known for “lepping” have long tails Student
(1927)).
µ4 (∑n xi )4 (∑n (Xi − X.))4
kurtosis = γ2 = 4 − 3 = i=1 − 3 = i=1 −3 (3.21)
σ n 4
∑i=1 xi ∑ni=1 (Xi − X.)4
Given the standard error of the skew (Equation 3.20) and the standard error of kurtosis
(Equation 3.22), it is possible to test whether a particular distribution has excess skew or
kurtosis. �
24
σγ2 = (3.22)
N
Although it is frequently reported that in positively skewed distribution, the mode will be
less than the mean which will be less than the median (e.g., Figure 3.9 , this is not always
the case. von Hippel (2005) discusses a number of counter examples.
76 3 The problem of scale

Modeling income with a log normal

Probability density (x 10^4)

0.12
Median = 48,060
Trimmed mean = 55,590
Mean = 66,470

0.06
0.00

0 50000 100000 150000 200000 250000

US Family Income (modeled, 2008)

US Census Family Income


0.030
Proportion of families

0.015
0.000

0 50000 100000 150000 200000 250000

U.S. Family Income (actual, 2008)

Fig. 3.9 A log normal distribution is skewed to the right and represents the distribution of normally
distributed multiplicative processes. An example of US. family income distributions adapted from the
US Census (2008) is shown with a mean of $66,570, trimmed mean of $55,590, and median of $48,060.
Means, the median, skew and kurtosis were found from simulating 10,000 cases from the log normal
with a log mean of 10.5 and sd of .8. Curve drawn with the curve function plotting the dlnorm function
with mean of 10.5 and sd of .8: curve(dlnorm(x, 10.8, .8), x = c(0,250000)). The top panel shows the
modeled data, the lower panel, the actual data. Values for income above 100,000 are inferred from the
census data categories of 100-150, 150-200, 200-250 and fit with a negative exponential. See income
for the US census data set on family income. The smooth curve for the numbers less than 100,000 is
generated using the lowess function. The sawtooth alternation of the actual data suggests that people
are reporting their income to the nearest $5,000.

It is helpful to consider the distributions generated from several different families of dis-
tributions to realize that just because a distribution is symmetric and peaks in the middle
does not tell us much about the length of the tails. Consider the four distributions shown in
Figure 3.10. The top and bottom curves are normal, one with standard deviation 1, one with
standard deviation 2. Both of these have 0 skew and 0 kurtosis. However the other two, the
logistic and the Cauchy are definitely not normal. In fact, the Cauchy has infinite variance
and kurtosis!
The Cauchy distribution is frequently used as a counter example to those who want to
generalize the central limit theorem to all distributions, for the means of observations from
3.12 Generating commonly observed distributions 77

the Cauchy distribution are not distributed normally, but rather remain distributed as before.
The distribution is sometimes referred to as the “witch of Agnesi” (Stigler, 1999). The function
is
1
f (x) = (3.23)
π(1 + x2 )

Normal and non-normal


0.4

N(0,1)
Cauchy
logistic
N(0,2)
0.3
probability

0.2
0.1
0.0

-4 -2 0 2 4

Fig. 3.10 Symmetric and single peaked is not the same as being a normal distribution. Two of these
distributions are normal, differing only in their standard deviations, one, the logistic has slightly more
kurtosis, and one (the Cauchy) has infinite variance and kurtosis.

3.12 Generating commonly observed distributions

Many statistics books include tables of the t or F or χ 2 distribution. By using R this is


unnecessary since these and many more distributions can be obtained directly. Consider the
normal distribution as an example. dnorm(x, mean=mu, sd=sigma) will give the probability
78 3 The problem of scale

density of observing that x in a distribution with mean=mu and standard deviation= sigma.
pnorm(q,mean=0,sd=1) will give the probability of observing the value q or less. qnorm(p,
mean=0, sd=1) will give the quantile value of a value with probability p. rnorm(n,mean,sd)
will generate n random observations sampled from the normal distribution with specified
mean and standard deviation. Thus, to find out what z value has a .05 probability we ask for
qnorm(.05). Or, to evaluate the probability of observing a z value of 2.5, specify pnorm(2.5).
(These last two examples are one side p values).
Applying these prefixes (d,p,q, r) to the various distributions available in R allows us to
evaluate or simulate many different distributions (Table 3.12).

Table 3.12 Some of the most useful distributions for psychometrics that are available as functions.
To obtain the density, prefix with d, probability with p, quantiles with q and to generate random
values with r. (e.g., the normal distribution may be chosen by using dnorm, pnorm, qnorm, or rnorm.)
Each function has specific parameters, some of which take default values, some of which require being
specified. Use help for each function for details.

Distribution base name Parameter 1 Parameter 2 Parameter 3 example application


Normal norm mean sigma Most data
Multivariate normal mvnorm mean r sigma Most data
Log Normal lnorm log mean log sigma income or reaction time
Uniform unif min max rectangular distributions
Binomial binom size prob Bernuilli trials (e.g. coin flips)
Student’s t t df non-centrality Finding significance of a t-test
Multivariate t mvt df corr non-centrality Multivariate applications
Fisher’s F f df1 df2 non-centrality Testing for significance of F test
χ2 chisq df non-centrality Testing for significance of χ 2
Beta beta shape1 shape2 non-centrality distribution theory
Cauchy cauchy location scale Infinite variance distribution
Exponential exp rate Exponential decay
Gamma gamma shape rate scale distribution theoryh
Hypergeometric hyper m n k
Logistic logis location scale Item Response Theory
Poisson pois lambda Count data
Weibull weibull shape scale Reaction time distributions

3.13 Mixed distributions

The standard deviation and its associated transformations are useful if the data are normal.
But what if they are not? Consider the case of participants assessed on some measure. 90%
of these participants are sampled from a normal population with a mean of 0 and a standard
deviation (and variance) of 1. But if 10% of the participants are sampled from a population
with the same mean but 100 times as much variance (sd = 10), the pooled variance of the
sample will be .9 * 1 + .1 * 100 or 10.9 and the standard deviation will be 3.3. Although
it would seem obvious that these two distributions would appear to be very different, this
is not the case (see Figure 3.11 and Table 3.13). As discussed by Wilcox (2005), even if the
3.14 Robust measures of dispersion 79

contaminated distribution is a mixture of 90% N(0,1) and 10% N(3,40), the plots of the
uncontaminated and contaminated distributions look very similar.

Normal and contaminated data

0.4
Normal(0,1)
Contaminated
Normal(.3,3.3)
0.3
Probability of X
0.2
0.1
0.0

-10 -5 0 5 10

Fig. 3.11 Probability density distributions for a normal distribution with mean 0 and standard de-
viation 1, a contaminated distribution (dashed line) formed from combining a N(0,1) with a N(3,10),
and a normal with the same mean and standard deviation as the contaminated N(.3,3.3) (dotted line).
Adapted from Wilcox, 2005.

3.14 Robust measures of dispersion

Estimates of central tendency and of dispersion that are less sensitive to contamination
and outliers are said to be robust estimators. Just as the median and trimmed mean are
less sensitive to contamination, so is the median absolute difference from the median (mad).
Consider the following seven data sets: The first one (x) is simply a normal distribution with
mean 0 and sd of 1. Noise10, Noise20, and Noise40 are normals with means of 3 and standard
deviations of 10, 20, and 40 respectively. Mixed10 is made up of a mixture of 90% sampled
from x and 10% sampled from Noise10. Mixed20 has the same sampling frequencies, but noise
is sampled from Noise20. Similarly for Mixed40. X and the noise samples are created using
80 3 The problem of scale

the rnorm function to create random data with a normal distribution with a specified mean
and standard deviation. The mixtures are formed by combining (using c) random samples
(using sample) of the X and noise distributions. Descriptive statistics are found by describe.
The first four variables (X, Noise10, Noise20, and Noise40) are normally distributed, and
the means, trimmed means, and medians are almost identical. Similarly, the standard devi-
ations and median absolute deviations from the medians (MAD) are almost identical. But
this is not the case for the contaminated scores. Although the simple arithmetic means of
the mixed distributions reflect the contamination, the trimmed means (trimmed by dropping
the top and bottom 10%) and medians are very close to that of the uncontaminated distri-
bution. Similarly the MAD of the contaminated scores are barely affected (1.14 versus .99)
even though the standard deviations are drastically larger (12.58 versus 1.0).

Table 3.13 Generating distributions of normal and contaminated normal data. A normal distribution
with N(0,1) is 10% contaminated with N(3,10), N(3,20) or N(3,40). Although these mixtures are formed
from two normals, they are not normal, but rather have very heavy tails. Observe how the median
and trimmed mean are not affected by the contamination. Figure 3.11 shows a plot of the probability
density of the original and the mixed10 contaminated distribution. The contamination may be detected
by examining the difference between the standard deviation and the median absolute deviation from
the median or the kurtosis.

> n <- 10000


> frac <- .1
> m <- 3
> x <- rnorm(n)
> noise10 <- rnorm(n,m,sd=10)
> mixed10 <- c(sample(x,n * (1-frac),replace=TRUE),sample(noise10,n*frac,replace=TRUE))
> dmixed <- density(mixed,bw=.3,kernel="gaussian")
> noise20 <- rnorm(n,m,sd=20)
> noise40 <- rnorm(n,m,sd=40)
> mixed20 <- c(sample(x,n * (1-frac),replace=TRUE),sample(noise20,n*frac,replace=TRUE))
> mixed40 <- c(sample(x,n * (1-frac),replace=TRUE),sample(noise40,n*frac,replace=TRUE))
> data.df <- data.frame(x,noise10,noise20,noise40,mixed10,mixed20,mixed40)
> describe(data.df)

var n mean sd median trimmed mad min max range skew kurtosis se
x 1 10000 0.01 1.00 0.01 0.01 0.99 -4.28 3.71 7.99 0.02 -0.04 0.01
noise10 2 10000 2.89 9.91 2.94 2.89 9.95 -32.96 46.39 79.35 0.01 0.02 0.10
noise20 3 10000 3.14 20.21 3.10 3.12 20.25 -74.10 85.71 159.81 0.01 -0.02 0.20
noise40 4 10000 3.49 40.41 3.05 3.23 40.51 -149.29 169.40 318.69 0.06 -0.07 0.40
mixed10 5 10000 0.31 3.38 0.05 0.06 1.10 -30.01 46.39 76.40 2.29 24.49 0.03
mixed20 6 10000 0.27 6.41 0.02 0.02 1.14 -66.22 64.16 130.38 1.16 26.01 0.06
mixed40 7 10000 0.25 12.58 0.00 0.00 1.14 -117.71 169.40 287.11 1.01 27.64 0.13

Robust estimates of central tendency and of dispersion are important to consider when
estimating experimental effects, for although conventional tests such as the t-test and F-test
are not overly sensitive to Type I errors when the distributions are not normal, they are very
sensitive to Type II errors. That is to say, if the data are not normal due to contamination as
seen is Figure 3.11, true differences of central tendencies will not be detected by conventional
tests (Wilcox, 1987; Wilcox and Keselman, 2003; Wilcox, 2005). Robust estimates of central
tendency and robust equivalents of the t and F tests are slightly less powerful when the data
are truly normal, but much more powerful in cases of non-normality Wilcox and Keselman
3.16 What is the fundamental scale? 81

(2003); Wilcox (2005). Functions to do robust analysis are available in multiple packages,
including MASS, robust, and robustbase as well as from the web pages of various investigators
(e.g. Rand Wilcox at the University of Southern California).

3.15 Monotonic transformations of data and “Tukey’s ladder”

If the data are non-normal or if the relations are non-linear, what should we do? John Tukey
(1977) suggested a ladder of tranformations that can be applied to the data (Table 3.14,
Figure 3.12). These transformations have the effect of emphasizing different aspects of the
data. If the data are skewed heavily to the right (e.g., for reaction times or incomes), taking
logs or reciprocals deemphasizes the largest numbers and makes distinctions between the
smaller numbers easier to see. Similarly, taking squares or cubes of the data can make some
relationships much clearer. Consider the advantage of treating distance travelled as a func-
tion of squared time when study the effects of acceleration. Similarly, when examining the
damaging effects of wind intensity upon houses, squaring the wind velocity leads to a better
understanding of the effects. The appropriate use of the ladder of transformations is look at
the data, look at the distributions of the data, and then look at the bivariate plots of the
data. Try alternative transforms until these exploratory plots look better. Data analysis is
detective work and requires considering many alternative hypotheses about the best way to
treat the data.

Table 3.14 Tukey’s ladder of transformations. One goes up and down the ladder until the relationships
desired are roughly linear or the distribution is less skewed. The effect of taking powers of the numbers
is to emphasize the larger numbers, the effect of taking roots, logs, or reciprocals is to emphasize the
smaller numbers.
Transformation effect
x3 emphasize large numbers reduce negative skew
x2 emphasize large numbers reduce negative skew
x the basic data

x emphasize smaller numbers reduce positive skew
-1/x emphasize smaller numbers reduce positive skew
log(x) emphasize smaller numbers reduce positive skew
−1/x2 emphasize smaller numbers reduce positive skew
−1/x3 emphasize smaller numbers reduce positive skew

mention Box Cox?

3.16 What is the fundamental scale?

It would be nice to be able to answer this question with a simple statement, but the answer
is really that it all depends. It depends upon what is being measured and what inferences
we are trying to draw from the data. We have recognized for centuries that money, whether
expressed in dollars, ducats, Euros, Renminbi, or Yen is measured in a linear, ratio scale
but has a negatively accelerated effect upon happiness (Bernoulli, 1738). (That is, the utility
82 3 The problem of scale

Tukey's ladder of transformations

2
1
transformed

x^3
x^2
−1

x
sqrt(x)
−1/x
log(x)
−1/x^2
−2

−1/x^3

0.0 0.5 1.0 1.5 2.0

original

Fig. 3.12 Tukey (1977) suggested a number of transformations of data that allow relationships to be
seen more easily. Ranging from the cube to the reciprocal of the cube, these transformations emphasize
different parts of the distribution.

of money is negatively accelerated.) The perceived intensity of a stimulus is a logarithmic


function of the physical intensity (Weber, 1834b). The probability of giving a correct answer
on a test is an increasing but non-linear function of the normal way we think of ability
(Embretson and Hershberger, 1999; McDonald, 1999). The amount of energy used to heat
a house is a negative but linear function of the outside temperature. The time it takes to
fall a particular distance is a function of the square root of that distance. The gravitational
attraction between two masses is a function of the inverse of the squared distance. The hull
speed of a sailboat is function of the square root of the length of the boat. Sound intensity
in decibels is expressed in logarithmic units of the ratio of the power of the observed sound
to the the power of a reference sound. The units of the pH scale in chemistry are (negative)
logarithmic units of the concentration of hydrogen ions.
The conclusion from these examples is that the appropriate scale is one that makes the
relationships between our observed variables and manipulations easier to understand and to
communicate. The scales of our observed variables are reflections of the values of our latent
variables (Figure 3.1) and are most useful when they allow us to simplify our inferences about
3.16 What is the fundamental scale? 83

the the relationships between the latent variables. By not considering the scaling properties
of our observations it is easy to draw incorrect conclusions about the the underlying processes
(consider the example discussed in section 3.6). By searching for the transformations that
allow us to best represent the data perhaps we are able to better understand the latent
processes involved.
Chapter 4
Covariance, Regression, and Correlation

“Co-relation or correlation of structure” is a phrase much used in biology, and not least in that
branch of it which refers to heredity, and the idea is even more frequently present than the
phrase; but I am not aware of any previous attempt to define it clearly, to trace its mode of
action in detail, or to show how to measure its degree.(Galton, 1888, p 135)

A fundamental question in science is how to measure the relationship between two vari-
ables. The answer, developed in the late 19th century, in the the form of the correlation
coefficient is arguably the most important contribution to psychological theory and method-
ology in the past two centuries. Whether we are examining the effect of education upon
later income, of parental height upon the height of offspring, or the likelihood of graduating
from college as a function of SAT score, the question remains the same: what is the strength
of the relationship? This chapter examines measures of relationship between two variables.
Generalizations to the problem of how to measure the relationships between sets of variables
(multiple correlation and multiple regression) are left to Chapter 5.
In the mid 19th century, the British polymath, Sir Francis Galton, became interested
in the intergenerational similarity of physical and psychological traits. In his original study
developing the correlation coefficient Galton (1877) examined how the size of a sweet pea
depended upon the size of the parent seed. These data are available in the psych package
as peas. In subsequent studies he examined the relationship between the average height of
mothers and fathers with those of their offspring Galton (1886) as well as the relationship
between the length of various body parts and height Galton (1888). Galton’s data are avail-
able in the psych packages as galton and cubits (Table 4.1)1 . To order the table to match
the appearance in Galton (1886), we need to order the rows in decreasing order. Because
the rownames are characters, we first convert them to ranks.
Examining the table it is clear that as the average height of the parents increases, there is a
corresponding increase in the heigh of the child. But how to summarize this relationship? The
immediate solution is graphic (Figure 4.1). This figure differs from the original data in that
the data are randomly jittered a small amount using jitter to separate points at the same
location. Using the interp.qplot.by function to show the interpolated medians as well as the
first and third quartiles, the medians of child heights are plotted against the middle of their
parent’s heights. Using a smoothing technique he had developed to plot meterological data
Galton (1886) proceeded to estimate error ellipses as well as slopes through the smoothed

1 For galton, see also UsingR.

85
86 4 Covariance, Regression, and Correlation

Table 4.1 The relationship between the average of both parents (mid parent) and the height of their
children. The basic data table is from Galton (1886) who used these data to introduce reversion to the
mean (and thus, linear regression). The data are available as part of the UsingR or psych packages. See
also Figures 4.1 and 4.2.

> library(psych)
> data(galton)
> galton.tab <- table(galton)
> galton.tab[order(rank(rownames(galton.tab)),decreasing=TRUE),] #sort it by decreasing row values

child
parent 61.7 62.2 63.2 64.2 65.2 66.2 67.2 68.2 69.2 70.2 71.2 72.2 73.2 73.7
73 0 0 0 0 0 0 0 0 0 0 0 1 3 0
72.5 0 0 0 0 0 0 0 1 2 1 2 7 2 4
71.5 0 0 0 0 1 3 4 3 5 10 4 9 2 2
70.5 1 0 1 0 1 1 3 12 18 14 7 4 3 3
69.5 0 0 1 16 4 17 27 20 33 25 20 11 4 5
68.5 1 0 7 11 16 25 31 34 48 21 18 4 3 0
67.5 0 3 5 14 15 36 38 28 38 19 11 4 0 0
66.5 0 3 3 5 2 17 17 14 13 4 0 0 0 0
65.5 1 0 9 5 7 11 11 7 7 5 2 1 0 0
64.5 1 1 4 4 1 5 5 0 2 0 0 0 0 0
64 1 0 2 4 1 2 2 1 1 0 0 0 0 0

medians. When this is done, it is quite clear that a line goes through most of the medians,
with the exception of the two highest values.2
A finding that is quite clear is that there is a “reversion to mediocrity” Galton (1877,
1886). That is, parents above or below the median tend to have children who are closer to
the median (reverting to mediocrity) than they. But this reversion is true in either direction,
for children who are exceptionally tall tend to have parents who are closer to the median
than they. Now known as regression to the mean, misunderstanding this basic statistical
phenomena has continued to lead to confusion for the past century Stigler (1999). To show
that regression works in both directions Galton’s data are also plotted for child regressed on
mid parent (left hand panel) or the middle parent height regressed on the child heights (right
hand panel of Figure 4.2.
Galton’s solution for finding the slope of the line was graphical although his measure of
reversion, r, was expressed as a reduction in variation. Karl Pearson, who referred to Galton’s
function later gave Galton credit as developing the equation we now know as the Pearson
Product Moment Correlation Coefficient Pearson (1895, 1920).
Galton recognized that the prediction equation for the best estimate of Y, Ŷ , is merely
the solution to the linear equation
Ŷ = by.x X + c (4.1)
which, when expressed in deviations from the mean of X and Y, becomes

ŷ = by.x x. (4.2)

2 As discussed by Wachsmuth et al. (2003), this bend in the plot is probably due to the way Galton
combined male and female heights.
4 Covariance, Regression, and Correlation 87

Galton's regression

74
● ●
● ●
● ●
● ● ●●
● ●
● ● ● ●
● ●
● ● ● ● ●
● ●
●● ●● ●
● ● ● ● ● ●
● ●● ●● ● ● ● ● ● ●●
● ●
● ● ● ●● ●

72
●● ●● ●●
● ● ● ● ●● ● ●● ●
● ●
● ● ●


●● ● ●
● ●●●● ●

●●

● ● ●● ● ●● ● ● ● ●● ● ●
● ● ● ● ● ●● ●● ●

● ● ● ● ● ●●● ●● ● ●●● ●●● ●● ● ●
● ● ●● ●●●● ● ●● ●
● ●●●● ●

● ●● ● ●● ●


● ● ●● ●● ●● ●● ● ● ● ●

● ●● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ●
70
● ● ● ● ● ●
● ● ● ●
●● ● ● ● ● ● ●● ● ●
● ● ●● ● ● ● ● ●●● ●●● ●●●
● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ●● ● ●
● ●●
●●
●● ● ●●●●●●● ●● ● ● ●
●● ●● ● ● ●● ●● ● ● ●
● ● ●● ●●● ● ●
●●
●●
● ●●●●●
●●●●
● ● ● ● ●●● ● ●● ● ●● ●
Child Height

● ● ●
● ● ● ●● ● ● ●●●
●●●●●●●
●●
● ● ●● ●● ●●●●
● ●●●● ● ●●●●● ●● ● ● ● ●
● ●● ●
●● ● ● ● ●●
● ●● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●●
● ●● ● ●● ●● ●●●●● ●●● ● ● ● ● ● ● ●● ● ●
● ● ● ●● ● ●● ● ●● ●●● ●● ● ●●●●● ●

68

● ● ●● ● ●●● ● ● ●●● ●

●●●● ●●● ●● ●● ● ● ●
● ● ● ●● ●● ● ● ● ●●
● ●●
● ● ● ● ● ●● ● ● ●
●● ● ● ●● ● ● ●
● ● ● ● ●● ●●● ● ● ●● ●●● ● ● ● ●● ●● ●● ●●
● ●● ●●●
●● ● ●●●
● ● ● ● ●●● ● ● ●● ●● ●

● ● ● ●●● ● ●● ●●● ● ●●● ●● ●●●●●●● ● ●
● ●● ● ●
●● ● ●● ● ● ●
●● ● ● ● ●● ● ●●● ●

● ●● ● ●
● ●● ●●● ●●●●● ●●
● ● ●
●●
● ● ● ●
● ● ●● ● ● ●
● ● ● ● ● ●● ● ●● ●●● ●●● ● ●● ●
●●● ●
● ●
● ● ●
● ● ● ●● ● ●● ● ● ●● ●
66

● ● ●● ● ●●●● ●● ● ●● ●
● ● ●● ●● ● ●●● ● ● ● ●●●● ●● ● ● ●● ● ●
● ● ● ● ● ● ●
●● ●●● ●● ● ●●
● ● ● ●● ●● ●

● ●
● ● ●● ● ● ● ● ●
● ● ● ● ● ●
● ●● ●●● ●●●
● ●●
● ● ●
● ●● ● ● ● ●● ● ● ●●● ● ● ●
● ● ●● ● ● ●
● ● ● ● ● ●●
64

● ● ● ●
● ● ●
● ● ● ●● ● ● ●●●●●● ● ●●
● ●● ● ● ●
● ●●●● ● ●● ●
● ● ● ●● ● ● ●
● ● ● ● ● ●● ●
● ●
● ●
62

● ● ● ● ●

● ●

64 66 68 70 72

Mid Parent Height

Fig. 4.1 The data in Table 4.1 can be plotted to show the relationships between mid parent and child
heights. Because the original data are grouped, the data points have been jittered to emphasize the
density of points along the median. The bars connect the first, 2nd (median) and third quartiles. The
dashed line is the best fitting linear fit, the ellipses represent one and two standard deviations from
the mean.

The question becomes one of what slope best predicts Y or y. If we let the residual of
n
prediction be e = y − ŷ, then Ve , the average squared residual ∑ e2 /n, will be a quadratic
i=1
function of by.x :
n n n n
Ve = ∑ e2 /n = ∑ (y − ŷ)2 /n = ∑ (y − by.x x)2 /n = ∑ (y2 − 2by.x xy + b2y.x x2 )/n (4.3)
i=1 i=1 i=1 i=1

Ve is minimized when the first derivative with respect to b of equation 4.3 is set to 0.
n
d(Ve )
= ∑ (2xy − 2by.x x2 )/n = 2Covxy − 2bσx2 = 0 (4.4)
d(b) i=1

which implies that


Covxy
by.x = . (4.5)
σx2
That is, by.x , the slope of the line predicting y given x that minimizes the squared residual
(also known as the squared error of prediction) is the ratio of the Covariance between x and
y and the Variance of X. Similarly, the slope of the line that best predicts x given values of
y will be
Covxy
bx.y = . (4.6)
σy2
88 4 Covariance, Regression, and Correlation

Child ~ mid parent Mid parent ~ child

74

74

72

72



70

70

Mid Parent Height



Child Height

● ● ●

● ●
68

68
● ● ●
● ●

● ●

66

66


64

64
b = 0.65 b = 0.33
62

62
62 64 66 68 70 72 74 62 64 66 68 70 72 74

Mid Parent Height Child Height

Fig. 4.2 Galton (1886) examined the relationship between the average height of parents and their
children. He corrected for sex differences in height by multiplying the female scores by 1.08, and then
found the average of the parents (the mid parent). Two plots are shown. The left hand panel shows
child height varying as the mid parent height. The right hand panel shows mid parent height varying
as child height. For both panels, the vertical lines and bars show the first, second (the median), and
third interpolated quartiles. The slopes of the best fitting lines are given (see Table 4.2). Galton was
aware of this difference in slopes and suggested that one should convert the variability of both variables
to standard units by dividing the deviation scores by the inter-quartile range. The non-linearity in the
medians for heights about 72 inches is discussed by Wachsmuth et al. (2003)

As an example, consider the galton data set, where the variances and covariances are found
by the cov function and the slopes may be found by using the linear model function lm
(Table 4.2). There are, of course two slopes: one for the best fitting line predicting the height
of the children given the average (mid) of the two parents and the other is for predicting the
average height of the parents given the height of their children. As reported by Galton, the
first has a slope of .65, the second a slope of .33. Figure 4.2 shows these two regressions and
plots the median and first and third quartiles for each category of height for either the parents
(the left hand panel) or the children (the right hand panel). It should be noted how well the
linear regression fits the median plots, except for the two highest values. This non-linearity
is probably due to the way that Galton pooled the heights of his male and female subjects
(Wachsmuth et al., 2003).

4.1 Correlation as the geometric mean of regressions

Galton’s insight was that if both x and y were on the same scale with equal variability, then
the slope of the line was the same for both predictors and was measure of the strength of their
relationship. Galton (1886) converted all deviations to the same metric by dividing through
4.1 Correlation as the geometric mean of regressions 89

Table 4.2 The variance/covariance matrix of a data matrix or data frame may be found by using the
cov function. The diagonal elements are variances, the off diagonal elements are covariances. Linear
modeling using the lm function finds the best fitting straight line and cor finds the correlation. All three
functions are applied to the Galton dataset galton of mid parent and child heights. As was expected
by Galton (1877), the variance of the mid parents is about half the variance of the children, the slope
predicting child as a function of mid parent is much steeper than that of predicting mid parent from
child. The cor function finds the covariance for standardized scores.

> data(galton)
> cov(galton)
> lm(child~parent,data=galton)
> lm(parent~child,data=galton)
> round(cor(galton),2)

parent child
parent 3.194561 2.064614
child 2.064614 6.340029

Call:
lm(formula = child ~ parent, data = galton)

Coefficients:
(Intercept) parent
23.9415 0.6463

Call:
lm(formula = parent ~ child, data = galton)

Coefficients:
(Intercept) child
46.1353 0.3256

parent child
parent 1.00 0.46
child 0.46 1.00

by half the interquartile range, and Pearson (1896) modified this by converting the numbers
to standard scores (i.e., dividing the deviations by the standard deviation). Alternatively, the
geometric mean of the two slopes (bx y and by x) leads to the same outcome:

� (CovxyCovyx Covxy Covxy
rxy = bxy byx = =� = (4.7)
σx σy
2 2
σ 2σ 2 σx σy
x y

which is the same as the covariance of the standardized scores of X and Y.


Covxy
rxy = Covzx zy = Cov x y = (4.8)
σx σy σx σy
90 4 Covariance, Regression, and Correlation

In honor of Karl Pearson (1896), equation 4.8, which expresses the correlation as the product
of the two standardized deviation scores, or the ratio of the moment of dynamics to the
square root of the product of the moments of inertia, is known as the Pearson Product Mo-
ment Correlation Coefficient. Pearson (1895, 1920), however, gave credit for the correlation
coefficient to Galton (1877) and used r as the symbol for correlation in honor of Galton’s
function or the coefficient of reversion. Correlation is done in R using the cor function, as
well as rcorr in the Hmisc package. Tests of significance (see section 4.4.1) are done us-
ing cor.test. Graphic representations of correlations that include locally smoothed linear
fits (lowess regressions) are shown in the pairs or in the pairs.panels functions. For the
galton data set, the correlation is .46 (Table 4.2).

Fig. 4.3 Scatter plots of matrices (SPLOMs) are very useful ways of showing the strength of relation-
ships graphically. Combined with locally smoothed regression lines (lowess), histograms and density
curves, and the correlation coefficient, SPLOMs are very useful exploratory summaries. The data are
from the sat.act data set in psych.

5 10 20 30 200 400 600 800

age

60
0.11 −0.04 −0.03

40
20
35

●●●●●●●●●●● ● ●● ●● ● ● ● ●
●●●●● ●●● ● ● ● ● ● ●●●●●● ● ● ●
●●●●●●● ●● ● ●●●● ● ● ●●

ACT
●●●●●●●● ●● ● ● ● ●●●
● ● ●●●●●●●●●●●●●●● ●●●●● ●● ● ● ● ●
●●● ●●●● ●● ●● ●●●● ●
● ●●●●●●●●●●● ● ● ● ●●●●●● ●●● ● ● ●

●●●●●●●●●● ●●●● ●●●● ●

0.56 0.59
●●●●●●●●●●●●● ●● ● ● ●● ●● ● ●●
●●●●●●●●● ●●●●● ● ●● ●
25

●●●●●●●●● ●●●●●● ● ●●● ● ●


●●●●●●●●● ● ● ●●● ● ● ● ●
● ●●●●●● ●● ● ●● ● ●
●●● ●●●● ● ● ●●● ● ● ●
●●●●●●●●● ● ●
●●●● ●●● ●● ● ●
● ●●●● ●●● ● ●
●● ● ●
●●● ●
●● ● ●
15

● ● ●
●● ● ●
5

800
●●●●●●●●● ●●
● ● ● ● ●●●● ● ●●●●●● ●
● ●
●●
●●
● ● ●● ●●● ● ●● ●●● ●●
●●
● ●● ●●
● ●● ● ● ●●
●●●● ●● ●●

SATV

●●
●●
●●
●●●
● ●●●
● ● ● ● ● ●●● ● ●
●●●●●●●●

●●●
●●
●●●●● ●●● ● ● ●● ● ●
●●● ● ●
●● ● ● ●
●●
●●
●●
●●
●●●●● ●● ●●
●● ●●●●●
●●
●● ●● ●

●●


●●

●●

●●
●●
●●
●● ●●
●●●●●●
●● ●● ●●●● ●
●● ●● ● ● ●

●●

●●●●●
●●
●●●

● ●●

●● ● ● ●●● ● ●
●●●
●●●● ● ●● ●●●● ●
●●●●●

0.64
●● ●● ●●
●●●●●
●●●●● ● ● ● ● ● ● ●●
●●● ●●●
●●
●● ●
●●
●●● ●● ● ● ●● ● ●● ●
●●●● ●
●●●● ●●●

600

●●●●
●● ●
●●●●●● ●● ● ●
●● ●
●● ●
● ●
●●
●● ●


●●
● ●●●

●●●●


●●●●

●●
●●

●●●
●●● ●
● ●●
●●●
●●




●● ●● ●●● ●
●●●
● ● ●
●●●●●●
●●●●●● ● ●●
●● ● ●●● ●●

● ● ●●●●●●●
●●●
●●
● ●
●●●
●●●
● ●
●●

● ●
●●


●●●

●●●

●●●●●
● ● ●●●●




●●
●●●
●●●●●
●● ●●
●● ● ● ● ●● ●●●● ●
●● ●●●
●●● ● ●
●●
●●
●●
●●
●● ●●
●●●

●●
●● ● ●● ● ●● ●●
● ●
● ●●●●●●

●●●●●
●● ●●● ●

●●
●●●
●●● ● ● ● ●●●●●● ●● ● ●

●●●●
●●● ●●
●●●● ● ●●●● ●●● ●● ● ●● ●●● ● ● ●●●●● ● ● ●●● ●

●●●
●● ●● ● ● ● ●● ●●● ● ● ●
●●
●● ●● ● ●●● ● ● ●●
●●●●●● ● ●● ● ● ●
●●● ● ●●●● ● ● ●

400
● ● ●● ●
●●
●● ●● ●● ●
●● ●●● ●

●● ●● ●
●●● ●●● ● ● ●● ● ●●●● ●●
● ● ●●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
●●●●● ● ● ● ● ●● ● ● ● ●

● ● ●

200

●● ● ●
800

● ●●●●●● ●● ● ● ●● ● ● ● ●●●●●● ● ● ● ●●● ●● ● ●●●● ●●


●●

● ● ●● ●●●
●● ● ● ●● ●●●●● ●●● ● ●●● ● ● ●● ●● ●● ●●●●● ●
● ●●
● ●● ●●● ●

SATQ
●●●
●●●
● ●●●
●●
●●● ●●●
●●● ●● ●
●● ●●●
●● ● ● ● ● ● ● ●●● ● ●
●●●●

● ●● ●●
● ●● ● ●●●●

●●● ●●
● ●●●
●● ●
●●
● ●●
● ●●● ●●● ●● ● ● ● ● ●● ● ●●
●●
● ●● ●
● ●●●
●●●● ●●●
●● ●
● ●●●
●●●●

●●●●
●●●● ●●●
●● ● ● ● ● ● ●●●●

●●●
●●
● ●
● ●
●● ● ● ● ●●● ● ● ●●●●●
●●
●●●● ● ●
●●
●●
●●●● ●●
●● ●●●●
●● ●●●
●● ● ●● ● ● ● ● ● ●
●● ● ● ●
●●●● ● ●●●●● ●●●●

●●● ●
●●

●●●●
●●●●● ●●● ●●
● ●●●●●
●●●● ●
● ●●●● ●●●
●●
●●
●●●●●●
●● ●●
●●
●●● ●●

●●●





●●●●●● ●
●●
● ● ● ●● ●●●●
● ● ●●●●●● ●
●●● ●●● ● ●●
●●●
●●●
●●
●● ● ●●●●● ●
● ●●
● ●●
●●
●●●●●●● ●● ●●●●
●● ● ●● ● ● ● ●
●●● ●●●●

●● ● ●
●●●●● ●●●●
● ● ● ●● ●●●●●
● ●●●
●● ●
●●●
●●●●●● ●
●●●●●●
600

●●●●●●● ● ● ●
●● ● ●
●●●● ●
●●●●●● ●● ●●●● ●●
●● ●●●
● ●● ●
●● ●
●●●●● ● ●● ●● ● ●●●●●
● ●
●●
●●
●●
●● ●●
●●●
●● ●
●●



●●●●● ●


●● ●
●●●
●●● ● ●●●
● ●● ● ●
● ●● ●● ● ●● ● ●
●● ●
● ●●●●●●●
●●


●●
●●●

● ●
●●●●

● ●

●●




●●●


●●
● ●●


● ●
●●●
● ●●
● ● ● ● ● ●●●●●●●

● ●● ●● ●●● ●●● ●●●
●●●
●●● ●

●●● ●●●
●●

●●● ●

●●● ●●● ●
●● ●

●●●
● ●●●●
●● ● ● ● ●●●●●●● ●●●●● ●● ● ● ● ● ●●●
● ●● ●●●
● ● ●● ● ●● ● ●● ●●
●●
●●●●●● ●● ● ●●● ● ● ● ● ●● ●●●
● ● ● ●● ●● ●● ●● ●●●●
● ●●●
●●●●
●● ● ●●
● ●● ●
●●● ● ● ●●
● ●
●●●● ● ● ●
●●●
● ● ● ● ● ● ●● ●●● ● ● ● ●
●●●●●●
●●
●●●●●●●●● ● ●● ●●●●●●● ●●●● ●●●● ● ● ●●●● ● ● ●●●●
● ●● ● ●● ●●● ● ● ●●● ● ●
●●
●●
●●●● ● ● ● ●● ●●
● ●● ●●●● ● ●● ●
●●
●● ●●●● ●
●●●
●●●●●●●●● ●
● ● ●●●●● ● ● ● ●●●●●●
● ●●● ●●● ●●● ● ●● ●● ● ●
●●●● ● ● ●● ● ●●

400

●●●
●● ● ● ● ● ●
●●● ● ● ● ●●● ● ●●●● ● ●
● ●●●●●● ● ● ● ●

●●
● ● ●●● ● ●●●● ● ● ● ● ● ● ●●
● ●●
● ● ●● ● ●
● ●

● ● ● ● ●

● ●●● ● ● ● ● ● ● ● ● ● ●●
● ● ●● ●●
● ●●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●
● ● ●
200

● ● ●
● ● ●

20 30 40 50 60 200 400 600 800

4.2 Regression and prediction

The slope by.x was found so that it minimizes the sum of the squared residual, but what is
it? That is, how big is the variance of the residual? Substituting the value of by.x found in
Eq 4.6 into Eq 4.3 leads to
4.3 A geometric interpretation of covariance and correlation 91
n n n n
Vr = ∑ r2 /n = ∑ (y − ŷ)2 /n = ∑ (y − by.x x)2 /n = ∑ (y2 + b2y.x x2 − 2by.x xy)/n
i=1 i=1 i=1 i=1

Cov2xy Covxy
Vr = Vy + b2y.xVx − 2by.xCovxy = Vy + Vx − 2 Covxy
Vx2 Vx
Cov2xy Cov2xy Cov2xy
Vr = Vy + −2 = Vy −
Vx Vx Vx
2 2
Vr = Vy − rxy Vy = Vy (1 − rxy ) (4.9)
That is, the variance of the residual in Y or the variance of the error of prediction of Y is
the product of the original variance of Y and one minus the squared correlation between X
and Y. This leads to the following table of relationships:

Table 4.3 The basic relationships between Variance, Covariance, Correlation and Residuals
Variance Covariance with X Covariance with Y Correlation with X Correlation with Y
X Vx Vx Cxy 1 rxy
Y Vy Cxy Vy rxy 1
Ŷ 2V
rxy Cxy = rxy σx σy rxyVy 1 rxy
y
2 )V 2 )V

Yr = Y − Ŷ (1 − rxy y 0 (1 − rxy y 0 1 − r2

4.3 A geometric interpretation of covariance and correlation

Because X and Y are vectors in the space defined by the observations, the covariance between
them may be thought of in terms of the average squared distance between the two vectors
in that same space (see Equation 3.14). That is, following Pythagorus, the distance, d, is
simply the square root of the sum of the squared distances in each dimension (for each pair
of observations), or, if we find the average distance, we can find the square root of the sum
of the squared distances divided by n:

1 n
dxy = ∑ (xi − yi )2
n i=1

or
1 n
2
dxy = ∑ (xi − yi )2 .
n i=1
which is the same as
2
dxy = Vx +Vy − 2Cxy
but because
Cxy
rxy =
σx σy
2
dxy = σx2 + σy2 − 2σx σy rxy (4.10)
92 4 Covariance, Regression, and Correlation

or �
dxy = 2 ∗ (1 − rxy ). (4.11)
Compare this to the trigonometric law of cosines,

c2 = a2 + b2 − 2ab · cos(ab),

and we see that the distance between two vectors is the sum of their variances minus twice
the product of their standard deviations times the cosine of the angle between them. That
is, the correlation is the cosine of the angle between the two vectors. Figure 4.4 shows these
relationships for two Y vectors. The correlation, r1 , of X with Y1 is the cosine of θ1 = the ratio
of the projection of Y1 onto√X. From the Pythagorean Theorem, the length of the residual Y
with X removed (Y.x) is σy 1 − r2 .

Correlations as cosines
1.0
0.5

1 − r22 y2
y1 1 − r21
Residual

θ2 θ1
0.0

− r2 r1
−0.5
−1.0

−1.0 −0.5 0.0 0.5 1.0

Fig. 4.4 Correlations may be expressed as the cosines of angles between two vectors or, alternatively,
the length of the projection of a vector of length one upon another. Here the correlation between X and
Y1 = r1 = cos(θ1 ) and the correlation between X and Y2 = r2 = cos(θ2 ). That Y2 has a negative correlation
with X means that unit change in X lead to negative changes in Y. The vertical dotted lines represent
the amount of residual in Y, the horizontal dashed lines represent the amount that a unit change in X
results in a change in Y.
4.4 The bivariate normal distribution 93

Linear regression is a way of decomposing Y into that which is predicable by X and that
which is not predictable (the residual). The variance of Y is merely the sum of the variances
of bX and residual Y. If the standard deviation of X, Y, and the residual Y are thought � of
as the length of their respective vectors, then the sin of the angle between X and Y is VVyr
and the vector of length 1 −Vr is the projection of Y onto X. (Refer to Table 4.3).

4.4 The bivariate normal distribution

If x and y are both continuous normally distributed variables with mean 0 and standard
deviation 1, then the bivariate normal distribution is
2
x +2x x +x 2
1 − 1 1 22 2
f (x) = � e 2(1−r ) . (4.12)
(2π(1 − r2 )

The mvtnorm and MASS packages provides functions to find the cumulative density func-
tion, probability density function, or to generate random elements from the bivariate normal
and multivariate normal and t distributions (e.g., rmvnorm and mvrnorm).

4.4.1 Confidence intervals of correlations

For a given correlation, rxy , estimated from a sample size of n observations and with the
assumption of bivariate normality, the t statistic with degrees of freedom, d f = n − 2 may be
used to test for deviations from 0 (Fisher, 1921).

r n−2
td f = √ (4.13)
1 − r2
Although Pearson (1895) showed that for large samples, that the standard error of r was

(1 − r2 )

n(1 + r2 )

Fisher (1921) used the geometric interpretation of a correlation and showed that for pop-
ulation value ρ, by transforming the observed correlation r into a z using the arc tangent
transformation:
1+r
z = 1/2 log( ) (4.14)
1−r
then z will have a mean
1+ρ
z̄ = 1/2 log( )
1−ρ
with a standard error of �
σz = 1/ (n − 3) (4.15)
94 4 Covariance, Regression, and Correlation

Confidence intervals for r are thus found by using the r to z transformation (Equation 4.14),
the standard error of z (Equation 4.15), and then back transforming to the r metric (fisherz
and fisherz2r). The cor.test function will find the t value and associated probability value
and the confidence intervals for a pair of variables. The rcorr function from Frank Harrell’s
Hmisc package will find Pearson or Spearman correlations for the columns of matrices and
handles missing values by pairwise deletion. Associated sample sizes and p-values are reported
for each correlation. The r.con function from the psych package will find the confidence
intervals for a specified correlation and sample size (Table 4.4).

Table 4.4 Because of the non-linearity of the r to z transformations, and particularly for large values
of the estimated correlation, the confidence interval of a correlation coefficient is not symmetric around
the estimated value. The two-tailed p values in the following table are based upon the t-test for a
difference from 0 with a sample size of 30 and are found using the pt function. The t values are found
directly from equation 4.13 by the r.con function.

> n <- 30
> r <- seq(0,.9,.1)
> rc <- matrix(r.con(r,n),ncol=2)
> t <- r*sqrt(n-2)/sqrt(1-r^2)
> p <- (1-pt(t,n-2))/2
> r.rc <- data.frame(r=r,z=fisherz(r),lower=rc[,1],upper=rc[,2],t=t,p=p)
> round(r.rc,2)

r z lower upper t p
1 0.0 0.00 -0.36 0.36 0.00 0.25
2 0.1 0.10 -0.27 0.44 0.53 0.15
3 0.2 0.20 -0.17 0.52 1.08 0.07
4 0.3 0.31 -0.07 0.60 1.66 0.03
5 0.4 0.42 0.05 0.66 2.31 0.01
6 0.5 0.55 0.17 0.73 3.06 0.00
7 0.6 0.69 0.31 0.79 3.97 0.00
8 0.7 0.87 0.45 0.85 5.19 0.00
9 0.8 1.10 0.62 0.90 7.06 0.00
10 0.9 1.47 0.80 0.95 10.93 0.00

add z to the table

4.4.2 Testing whether correlations differ from zero

Null Hypothesis Significance Tests, NHST , examine the likelihood of observing a particular
correlation given the null hypothesis of no correlation. This is may be found by using Fisher’s
test (equation 4.13) and finding the probability of the resulting t statistic using the pt function
or, alternatively, directly by using the corr.test function. Simultaneous testing of sets of
correlations may be done by the rcorr function in the Hmisc package.
The problem of whether a matrix of correlations, R, differs from those that would be
expected if sampling from a population of all zero correlations was addressed by Bartlett
(1950, 1951) and Box (1949). Bartlett showed that a function of the natural logarithm of the
determinant of a correlation matrix, the sample size (N), and the number of variables (p) is
asymptotically distributed as χ 2 :
4.4 The bivariate normal distribution 95

χ 2 = −ln |R| ∗ (N − 1 − (2p + 5)/6) (4.16)

with degrees of freedom, d f = p ∗ (p − 1)/2. The determinant of an identity matrix is one, and
as the correlations differ from zero, the determinant will tend towards zero. Thus Bartlett’s
test is a function of how much the determinant is less than one. This may be found by the
cortest.bartlett function. √
Given that the standard error of a z transformed correlation is 1/ n − 3, and that a
squared z scores is χ 2 , a very reasonable alternative is to consider whether the sum of the
squared correlations differs from zero. When multiplying this sum by n-3, this is distributed
as χ 2 with p*(p-1)/2 degrees of freedom. This is a direct test of whether the correlations
differ from zero Steiger (1980c). This test is available as the cortest function.

4.4.3 Testing the difference between correlations

There are four different tests of correlations and the differences between correlations that
are typically done: 1) is a particular correlation different from zero, 2) does a set of of
correlations differ from zero, 3) do two correlations (taken from different samples) differ from
each other, and 4) do two correlations taken from the same sample differ from each other.
The first question, does a correlation differ from zero was addressed by Fisher (1921) and
answered using a t-test of the observed value versus 0 (Equation 4.13) and looking up the
probability of observing that size t or larger with degrees of freedom of n-2 with a call to pt
(see Table 4.4 for an example). The second is answered by applying a χ 2 test (equation 4.16)
using either the cortest.bartlett or cortest functions. The last two of these questions
are more complicated and have two different sub questions associated with them.
Olkin and Finn (1990, 1995) and Steiger (1980a) provide very complete discussion and
examples of tests of the differences between correlations as well as the confidence intervals for
correlations and their differences. Three of the Steiger (1980a) tests are implemented in the
r.test function in the psych package. Olkin and Finn (1995) emphasize confidence intervals
for differences between correlations and address the problem of what variable to choose when
adding to a multiple regression.

4.4.3.1 Testing independent correlations: r12 is different from r34

To test two whether two correlations are different involves a z-test and depends upon whether
the correlations are from different samples or from the same sample (the dependent or cor-
related case). In the first case, where the correlations are independent, the correlations are
transformed to zs and the test is just the ratio of the differences (in z units) compared to
the standard error of a difference. The standard error is merely the square root of the sum
of the squared standard errors of the two individual correlations:
zr12 − zr34
zr12 −r34 = � (4.17)
1/(n1 − 3) + 1/(n2 − 3)
which is the same as
1+r34
1/2 log( 1+r
1−r12 ) − 1/2 log( 1−r34 )
12

zr12 −r34 = � .
1/(n1 − 3) + 1/(n2 − 3)
96 4 Covariance, Regression, and Correlation

This seems more complicated than it really is and can be done using the paired.r or r.test
functions (Table 4.5).

Table 4.5 Testing two independent correlations using Equation 4.17 and r.test results in a z-test.
> r.test(r12=.25,r34=.5,n=100)
$test
[1] "test of difference between two independent correlations"
$z
[1] 2.046730
$p
[1] 0.04068458

4.4.3.2 Testing dependent correlations: r12 is different from r23

A more typical test would be to examine whether two variables differ in their correlation
with a third variable. Thus with the variables X1 , X2 , and X3 with correlations r12 , r13 and r23
the t of the difference of r12 minus r13 is


� (n − 1) ∗ (1 + r23 )
tr12 −r13 = (r12 − r13 ) ∗ � � � (4.18)
13 2
2 n−3 |R| + r12 +r
(n−1)
2 (1 − r 23 ) 3

where |R| is the determinant of the 3 *3 matrix of correlations and is


2 2 2
|R| = 1 − r12 − r13 − r12 + 2 ∗ r12 r13 r23 (4.19)

(Steiger, 1980a). Consider the case of Extraversion, Positive Affect, and Energetic Arousal
with PA and EA assessed at two time points with 203 participants (Table 4.6). The t for
the difference between the correlations of Extraversion and Positive Affect at time 1 (.6)
and Extraversion and Energetic Arousal at time 1 (.5) is found from equation 4.18 using
the paired.r function and is 1.96 with a p value < .05. Steiger (1980a) and Dunn and
Clark (1971) argue that Equation 4.18, Williams’ Test (Williams, 1959), is preferred to
an alternative test for dependent correlations, the Hotelling T , which although frequently
recommended, should not be used.

4.4.3.3 Testing dependent correlations: r12 is different from r34

Yet one more case is the test of equality of two correlations both taken from the same
sample but for different variables (Steiger, 1980a). An example of this would be whether the
correlations for Positive Affect and Energetic Arousal at times 1 and 2 are the same. For four
variables (X1 ...X4 ) with correlations r12 ...r34 , the z of the difference of r12 minus r34 is

(z12 − z34 ) n − 3
zr12 −r34 = � (4.20)
2(1 − r12,34 )
4.4 The bivariate normal distribution 97

Table 4.6 The difference between two dependent correlations, rext,PA1 and rext,EA1 is found using Equa-
tion 4.18 which is implement in the paired.r and r.test functions. Because these two correlations
share a common element (Extraversion), the appropriate test is found in Equation 4.18.

Ext 1 PA 1 EA 1 PA 2 EA 2
Extraversion 1
Positive Affect 1 .6 1
Energetic Arousal 1 .5 .6 1
Positive Affect 2 .4 .8 .6 1
Energetic Arousal 2 .3 .6 .8 .5 1

> r.test(r12=.6,r13=.5,r23=.6,n=203)
Correlation tests
Call:r.test(n = 203, r12 = 0.6, r23 = 0.6, r13 = 0.5)
Test of difference between two correlated correlations
t value 2 with probability < 0.047

where

r12,34 = 1/2([(r13 − r12 r23 )(r24 − r23 r34 )] + [(r14 − r13 r34 )(r23 − r12 r13 )]
+[(r13 − r14 r34 )(r24 − r12 r14 )] + [(r14 − r12 r24 )(r23 − r24 r34 )])

reflects that the correlations themselves are correlated. Under the null hypothesis of equiv-
alence, we can also assume that the correlations r12 = r34 (Steiger, 1980a) and thus both of
these values can be replaced by their average
r12 + r34
r12
¯ = r34
¯ = .
2
Calling r.test with the relevant correlations strung out as a vector shows that indeed, the
two correlations (.6 and .5) do differ reliably with a probability of .04 (Table 4.7).

Table 4.7 Testing the difference between two correlations from the same sample but that do not
overlap in the variables included. Because the correlations do not involve the same elements, but do
involve the same subjects, the appropriate test is Equation 4.20.

> r.test(r12=.6,r34=.5,r13=.8,r14=.6,r23=.6,r24=.8,n=203)

Correlation tests
Call:r.test(n = 203, r12 = 0.6, r34 = 0.5, r23 = 0.6, r13 = 0.8, r14 = 0.6,
r24 = 0.8)
Test of difference between two dependent correlations
z value 2.05 with probability 0.04
98 4 Covariance, Regression, and Correlation

4.4.3.4 Testing whether correlation matrices differ across groups

The previous tests were comparisons of single correlations. It is also possible to test whether
the observed p * p correlation matrix, R1 for one group of subjects differs from R2 in a
different group of subjects. This has been addressed by several tests, perhaps the easiest
to understand is by Steiger (1980b). Given the null hypothesis of no difference, the sum of
squared differences of the z transformed correlation should be distributed as χ 2 with p *
(p-1) degrees of freedom. A somewhat more complicated derivation by Jennrich (1970) also
leads to a χ 2 estimate:
1
χ 2 = tr(Z2 ) − diag(Z)S−1 diag(Z) (4.21)
2
where R = (R1 + R2 )/2, and the elements of S are the squared elements of R, c = n1 * n2
/(n1+n2), and Z = c.5 R−1 (R1 − R2 ). Both of these tests are available in the psych package:
the normal.cortest finds the sum of squared differences of either the raw or z transformed
correlations, the jennrich.cortest finds χ 2 as estimated in equation 4.21. Monte Carlo
simulations of these and an additional test mat.cortest suggest that all three, and in par-
ticular the z-transformed and Jennrich tests are very sensitive to differences between groups
(Revelle and Wilt, 2008).

4.5 Other estimates of association

The Pearson Product Moment Correlation Coefficient (PPMCC ) was developed for the case
of two continuous variables with interval properties, which, with the assumption of bivari-
ate normality, can lead to estimates of confidence intervals and statistical significance (see
cor.test). As pointed out by Charles Spearman (1904b), the Pearson correlation may be
most easily thought of as
∑ xi yi
r= � (4.22)
∑ xi2 ∑ y2i
Dividing the numerator and the two elements in the square root by either n or n-1, this is,
of course, equivalent to Equation 4.8 for the Pearson Product Moment Correlation Coefficient.
A calculating formula that is sometimes used when doing hand calculations (for those who
are stuck without a working copy of R) and that is useful when finding PPMCC for special
cases (see below) uses raw scores rather than deviation scores:

n ∑ XiYi − ∑ Xi ∑ Yi
� . (4.23)
(n ∑ Xi2 − (∑ Xi )2 )(n ∑ Yi2 − (∑ Yi )2 )

Generalizing this to formula to Rx , the matrix of correlations between the columns of a


matrix x where x has been zero centered, let
1
Isd = diag( � )
diag(x)

that is, where Isd is a diagonal matrix of the reciprocals of the standard deviations of the
columns of x, then
4.5 Other estimates of association 99

Rx = Isd xx� Isd (4.24)


is the matrix of correlations between the columns of x.
There are a number of alternative measures of association, some of which appear very
different but are merely the PPMCC for special cases, while there are other measures for
cases where the data are clearly neither continuous nor at the interval level of measurement.
Even more coefficients of association are used as estimates of effect sizes.

4.5.1 Pearson correlation equivalents

Using Spearman’s formula for the correlation (Equation 4.22) allows a simple categoriza-
tion of a variety of correlation coefficients that at first appear different but are functionally
equivalent (Table 4.8).

Table 4.8 A number of correlations are Pearson r in different forms, or with particular assumptions.
If r = √ ∑ x2i yi 2 , then depending upon the type of data being analyzed, a variety of correlations are
∑ xi ∑ yi
found.
Coefficient symbol X Y Assumptions
Pearson r continuous continuous
Spearman rho (ρ) ranks ranks
Point bi-serial r pb dichotomous continuous
Phi φ dichotomous dichotomous
Bi serial rbis dichotomous continuous normality of latent X
Tetrachoric rtet dichotomous dichotomous bivariate normality of latent X, Y
polychoric r pc categorical categorical bivariate normality of latent X, Y
polyserial r ps categorical continuous bivariate normality of latent X, Y

4.5.1.1 Spearman ρ: a Pearson correlation of ranks

In the first of two major papers published in the American Journal of Psychology in 1904,
Spearman (1904b) reviewed for psychologists the efforts made to define the correlation co-
efficient by Galton (1888) and Pearson (1895). Not only did he consider the application of
the Pearson correlation to ranked data, but he also developed corrections for attenuation
and the partial correlation, two subjects that will be addressed later. The advantage of using
ranked data rather than the raw data is that it is more robust to variations in the extreme
scores. For whether a person has an 8,000 or a 6,000 on an exam, that he or she is the highest
score makes no difference to the ranks. Consider Y as ten numbers sampled from 1 to 20
and then find the Pearson correlation with Y 2 and eY . Do the same things for the ranks
of these numbers. That is, find the Spearman correlations. As is clear from Figure 4.5, the
Spearman correlation is not affected by the large non-linear transformation applied to the
data Spearman (1907).
It should be observed, that in many cases the non-linear form is more apparent than real.
Generally speaking, a mere tendency of two characteristics to vary concurrently must be taken,
100 4 Covariance, Regression, and Correlation

it seems to me, as the effect of some particular underlying strict law (or laws) partly neutralized
by a multitude of ’casual’ disturbing influences. The quantity of a correlation is neither more
nor less than the relative influence of the underlying law in question as compared with the total
of all the influences in play. Now, it may easily happen, that the underlying law is one of simple
proportionality but the disturbing influences become greater when the correlated characteristics
are larger (or smaller, as the case may be). Then the underlying simple proportionality will not
appear on the surface; the correlation will seem non-linear. Under such circumstances, r cannot,
it is true, express these variations in the quantity of correlation; it continues, however, to express
completely the mean quantity of correlation.
In the majority of the remaining cases of non-linearity, the latter is merely due to a wrong choice
of the correlated terms. For instance, the correlation between the length of the skull and the
weight of the brain must, obviously, be very far from linear. But linearity is at once restored
(supposing all the skulls to belong to one type) if we change the second term from the brain’s
weight to the cube root of the weight.
To conclude, even when the underlying law itself really has a special non-linear form, although
r by itself reveals nothing of this form, it nevertheless still gives (except in a few extreme and
readily noticeable cases) a fairly approximate measure of the correlation’s quantity. Spearman,
1907 p 168-169

Although a somewhat different formula is reported in Spearman (1904b), the calculating


formula is
6 ∑ d2
ρ = 1− (4.25)
n(n2 − 1)
where d is the difference in ranks. Holmes (2001) presents a very understandable graphical
proof of this derivation. Alternatively, just find the Pearson correlation of the ranked data.

4.5.1.2 Point biserial: A Pearson correlation of a continuous variable with a


dichotomous variable

If one of two variables, X, is dichotomous, and the other, Y, is continuous, it is still possible
to find a Pearson r, but this can also be done by using a short cut formula. An example of
this problem would be to ask the correlation between gender and height. Done this way, the
correlation is known as the point biserial correlation but it is in fact, just a Pearson r.
If we code one of the two genders as 0 and the other as one, then Equation 4.23 becomes

npq(Y¯2 − Y¯1 ) Y¯2 − Y¯1 npq
r pb = � = (4.26)
npq(n − 1)σy2 σy (n − 1)

where n is the sample size, p and q are the probabilities of being in group 1 or group 2, Y¯1
and Y¯2 are the mean of the two groups and σy2 is the variance of the continuous variable.
That is, the point biserial correlation is a direct function of the difference between the means
and the relative frequencies of two groups. For a fixed sample size and difference between the
group means, it will be maximized when the two groups are of equal size.
Thinking about correlations as reflecting the differences of means compared to the standard
deviation of the dependent variable suggests a comparison to the t-test. And in fact, the point
biserial is related to the t-test, for with d f = n − 2,
4.5 Other estimates of association 101

Table 4.9 The point biserial correlation is a Pearson r between a continuous variable (height) with a
dichotomous variable (gender). It is equivalent to a t-test with pooled error variance.

> set.seed(42)
> n <- 12 #sample size
> gender <- sample(2,n,TRUE) #create a random vector with two values
> height <- sample(10,n,TRUE)+ 58 + gender*3 #create a second vector with up to 10 values
> g.h <- data.frame(gender,height) #put into a data frame
> g.h[order(g.h[,1]),] #show the data frame
> cor(g.h) #the Pearson correlation between height and gender
> t.test(height~gender,data=g.h,var.equal=TRUE)
> r <- cor(g.h)[1,2] #get the value of the correlation
> r * sqrt((n-2)/(1-r^2)) #find the t- equivalent of the correlation, compare to the t-test.

gender height
2 1 63
3 1 67
4 1 67
6 1 63
9 1 66
11 1 66
12 1 67
1 2 66
5 2 70
7 2 71
8 2 66
10 2 67

gender height
gender 1.00000 0.54035
height 0.54035 1.00000

Two Sample t-test

data: height by gender


t = -2.0307, df = 10, p-value = 0.06972
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.0932281 0.2360852
sample estimates:
mean in group 1 mean in group 2
65.57143 68.00000

# t calculated from point biserial


[1] 2.030729
102 4 Covariance, Regression, and Correlation

0 200 2 6 10 2 6 10

15
0.98 0.59 0.98 0.98 0.98

5



y2
200
0.69 1.00 1.00 1.00






0

● ●
ye
0.72 0.72 0.72

0.0e+00
● ●

● ●●● ●● ● ●
● ● ● ●● ● ● ● ●●

yr
10

● ● ●
● ● ●
● ● ●

1.00 1.00
● ● ●
6

● ● ●

● ●
● ● ●
● ● ●
● ● ●
2

● ● ●
● ● ●

10








y2r
1.00
● ● ● ●

6

●● ●●


● ● ●●

● ● ● ●
● ● ● ●

2
● ● ● ●
● ● ● ●

yer
10

● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
6

● ● ● ● ●

● ●
● ● ● ●
● ●

● ● ● ● ●
● ● ● ● ●
2

● ● ● ● ●
● ● ● ● ●

5 15 0.0e+00 2 6 10

Fig. 4.5 Spearman correlations are merely Pearson correlations applied to ranked data. Here y is
randomly sampled from the interval 1-20. y2 is Y 2 , ye is eY , yr, y2r and yer are y, y2 and ye expressed
as ranks. The correlations are found as Pearson r, but those between the second three variables are
equivalent to Spearman ρ.

Y¯2 − Y¯1 Y¯2 − Y¯1 Y¯2 − Y¯1 Y¯2 − Y¯1 √


td f = � =� = � = npq (4.27)
σy.x
+
σy.x 2
(n1 +n2 )σy.x 2
σy.x σy.x
n1 n2
n1 n2 npq

Comparing Equations 4.27 and 4.26 and recognizing that the within cell variance in the t-test
2 = σ 2 (1 − r 2 )
is the residual variance in y after x is removed σy.x y
� �
(n − 2) df
t = r pb = r pb (4.28)
(1 − r2pb ) (1 − r2pb )

Although the t and point biserial correlation are transforms of each other, it is incorrect to
artificially dichotomize a continuous variable to express the relationship as a t value. If X and
Y are both continuous, the appropriate measure of relationship is the Pearson correlation.
By artificially dichotomizing one variable in order to express the effect as a t rather than a
4.5 Other estimates of association 103

r, the strength of the relationship is reduced. Compare the four panels of Figure 4.6. The
underlying scatter plot is shown for four values of a Pearson r (.9, .6, .3, and .0). Forming
groups by setting values of Y < 0 to 0 and values greater than or equal to 0 to 1, results
in the frequency distributions shown at the bottom of each panel. The corresponding point
biserial correlations are reduced by 20%. That is, the point biserial for equal sized groups is
.8 of the original Pearson r. In terms of power to detect a relationship, this is equivalent of
throwing away 36% of the observations.

r = 0.9 rpb = 0.71 rbis = 0.89 r = 0.6 rpb = 0.48 rbis = 0.6
3

3
2

2
1

1
0

0
y

y
-1

-1
-2

-2
-3

-3
-4 -3 -2 -1 0 1 2 3 -4 -3 -2 -1 0 1 2 3

x x

r = 0.3 rpb = 0.23 rbis = 0.28 r = 0 rpb = 0.02 rbis = 0.02


3

3
2

2
1

1
0

0
y

y
-1

-1
-2

-2
-3

-3

-4 -3 -2 -1 0 1 2 3 -4 -3 -2 -1 0 1 2 3

x x

Fig. 4.6 The point biserial correlation is a Pearson r between a continuous variable and a dichotomous
variable. If both X and Y are continuous and X is artificially dichotomized, the point biserial will be
less than the original Pearson correlation. The biserial correlation is based upon the assumption of
underlying normality for the dichotomized variable. It more closely approximates the “real” correlation.
The estimated density curves for y are drawn for the groups formed from the dichotomized values of
x.
104 4 Covariance, Regression, and Correlation

4.5.1.3 Phi: A Pearson correlation of dichotomous data

In the case where both X and Y are naturally dichotomous, another short cut for the Pearson
correlation is the phi (φ ) coefficient. A typical example might be the success of predicting
applicants to a graduate school. Two actions are taken, accept or reject, and two outcomes
are observed, success or failure. This leads to the two by two table (Table 4.10) In terms of

Table 4.10 The basic table for a phi, φ coefficient, expressed in raw frequencies in a four fold table
is taken from Pearson and Heron (1913)

Success Failure Total


Accept A B R1 = A + B
Reject C D R2 = C + D
Total C1 = A + C C2 =B + D n = A + B + C + D

the raw data coded 0 or 1, the phi coefficient can be derived directly from Equation 4.23 by
direct substitution, recognizing that the only non zero product is found in the A cell

n ∑ XiYi − ∑ Xi ∑ Yi = nA − R1C1

AD − BC
φ=� . (4.29)
(A + B)(C + D)(A +C)(B + D)
Table 4.10 may be converted from frequency counts to proportions by dividing all entries by
the total number of observations (A+B+C+D) to produce a more useful table (Table 4.11).
In this table, the total counts (A, B, C, D) are expressed as their proportions (a, b, c, d)
and the fraction of applicants accepted Rn1 = A+B+C+D
A+B
may be called the Selection Ratio, the
fraction rejected is thus 1-SR. Similarly, the fraction of students who would have succeeded
if accepted is Cn1 = A+B+C+D
A+C
may be called the Hit Rate, and the proportion who would fail
is 1-HR. If being accepted or succeeding is given a score of 1, and rejected or failing, a score

Table 4.11 The basic table for a phi coefficient expressed in proportions
Success Failure Total
Accept Valid Positive False Positive R
Reject False Negative Valid Negative 1 − R
Total C 1 −C

of 0, then the PPMCC of Table 4.11 may be found from Equation 4.23 as

∑ xi yi n ∑ XiYi − ∑ Xi ∑ Yi V P − RC
φ=� =� =� . (4.30)
∑ xi2 ∑ y2i (n ∑ Xi2 − (∑ Xi )2 )(n ∑ Yi2 − (∑ Yi )2 ) R(1 − R)C(1 −C)

The numerator is the number of valid positives (cell a or the percent of valid positives) less
the number expected if there were no relationship (R ∗ C). The denominator is the square
4.5 Other estimates of association 105

root of the product of the row and column variances. As can be seen by equation 4.30, for
fixed marginals the correlation is linear with the percentage of valid positives.
The φ coefficient is a PPMCC between two dichotomous variables. It is not, however, the
equivalent of the PPMCC of continuous data that have been being artificially dichotomized.
In addition, where the cuts are made greatly affects the correlation. Consider the case of two
normally distributed variables (X and Y) that are correlated .6 in the population. If these
variables are dichotomized at -1, 0, or 1 standard deviation from the mean, the correlations
between them are attenuated, most so for the case of one variable being cut at lower value and
the other being cut at the higher value. More importantly, the correlation of two dichotomized
variables formed from the same underlying continuous variable is also seriously attenuated
(Figure 4.7). That is, the correlations between the four measures of X (X, Xlow, Xmid, and
Xhigh), although based upon exactly the same underlying numbers range from .19 to .43.
Indeed, the maximum value of phi or phi max (φmax ) is a function of the marginal distributions
and is �
px qy
φmax = (4.31)
py qx
where px + qx = py + qy = 1 and px , py represent the proportion of subjects “passing” an item.
The point bi-serial correlation is also affected by the distribution of the dichotomous
variable. The first two rows in Figure 4.7 show how a continuous variable correlates between
.65 to .79 with a dichotomous variable based upon that continuous variable.

4.5.1.4 Tetrachoric and polychoric correlations

If a two by two table is thought to represent an artificial dichotomization of two continuous


variables with a bivariate normal distribution, then it is possible to estimate that correla-
tion using the tetrachoric correlation (Pearson, 1900; Carroll, 1961). A generalization of the
tetrachoric to more than two levels is the polychoric correlation. The tetrachoric function
may be used to find the tetrachoric correlation as can the polychor function in the polycor
package which also will find polychoric correlations.
Perhaps the major application of the tetrachoric correlation is when doing item analysis
when each item is assumed to represent an underlying ability which is reflected as a proba-
bility of responding correctly to the item and the items are coded as correct or incorrect. In
this case (discussed in more detail when considering Item Response Theory in Chapter 8),
the difficulty of an item may be expressed as a function of the item threshold , τ, or the
cumulative normal equivalent of the percent passing the item. The tetrachoric correlation
is then estimated by comparing the number in each of the four cells with that expected
from a bivariate normal distribution cut at τx and τy (see Figure 4.8 which was drawn using
draw.tetra).
Unfortunately, for extreme differences in marginals, estimates of the tetrachoric do not
provide a very precise estimate of the underlying correlation. Consider the data and φ corre-
lations from Figure 4.7. Although the polychoric correlation does a very good job of estimating
the correct value of the underlying correlation between X and Y (.60) for different values of
dichotomization, and correctly finds a very high value for the correlation between the various
sets of Xs and Ys (1.0), in some cases, if there are zero entries in one cell, the estimate is
seriously wrong. One solution to this problem is to apply a correction for continuity which
notes that a 0 case represents some where between 0 and .5 cases. tetrachoric automati-
cally applies this correction but warns when this happens. In Table 4.12, this correction was
106 4 Covariance, Regression, and Correlation

−3 1 0.0 0.8 0.0 0.8 0.0 0.8

X
0.61 0.68 0.42 0.79 0.49 0.65 0.37

−2


●●●●




































●●●
●●
●●●
● Y
0.41 0.66 0.48 0.79 0.41 0.65
●●●

●● ●●

1
●●
●●

●●


●●



●●



●●


●●

●●●






●●

●●

●●●
●●●
●●
●●


●●●


●●

●●

●●

●●


●●















































































●●
●●





●●

●●













●●









































●●


●●●




●●●
●●

●●


●●



●●


●●●
●●●
●●●

●●
●●

●●
●●●

● ●

−3
●●●●
●●●
●●


●●
●●
●●●

●●

●●●

●●

●●
●●●
●●

●●
●●
●●

●●
●●
●●
●●

●●
●●

●●

●●

●●

●●●
●●

●●
●●

●●



●●
●●
●●


●●


●●

●●

●●
●●
● ●

●●


●●

●●
●●

●●●

●●

●●
●●


●●


●●

●●
●●
●●●

●●

●●●
●●

●●
●●


●●
●●

●●
●●●
●●
●●
●●●

●●
●●


●●

●●

●●

●●

●●



●●


●●

●●

●●
●●●
Xlow

0.8
● ●
0.39 0.45 0.34 0.19 0.17

0.0
●●
●●
●●

●●



●●●


●●


●● ●


●●

●●

●●


●●

●●

●●


●●


●●
●●

●●

●●
●●●●●

●●


●●
●●●

●●●

●●
●●

●●
●●


●●
●●

●●
●●
●●●

●●●
●●

●●●
●●
●●

●●

●●

●●
●●

●●

●●
●●
●●●


●●
●●

●●●
●●

●●
●●

●●


●●
●●

●●

●●


●●

●●

●●●
●●
● ●
●●
●●

●●

●●

●●●

●●

●●

●●

●●●

●●
●●●

●●
●●

●●
●●●

●●
●●

●●●
●●
●●

●●
●●
●●
●●
●●

●●


●●

●●

●●

●●


●●

●●

●●
●●

●● ● ●
Ylow
0.8

● ● ●
0.31 0.41 0.18 0.17
0.0

●●

●●
●●


●●

●●


●●

●●


●●

●●


●● ●


●●

●●
●●

●●

●●
●●

●●


●● ● ●


●●

●●

●●

●●●
●●

●●
●●

●●

●●
●●


●●
●●


●●

●●

●●
●●
●●
● ●●●

●●

●●

●●
●●

●●●



●●



●●

●●
●●


●●
●●
●●

●●
●●


●●

●●



●●

●●

●●

●●
●●
●● ● ● ●
Xmid

0.8
● ● ● ● 0.43 0.43 0.30

0.0
●●
●●
●●

●●



●●●


●●●
●●


●●
●●

●●
●●


●●
●●

●●
●●
●●●



●●


● ●


●●

●●

●●

●●
●●

●●
●●


●●
●●
●●

●●

●●

●●●



●●●


●●

●●
●●
●●


●●
●●


●●
●●

●●
●●

●● ● ● ● ●

●●

●●



●●


●●


●●

●●

●●

●●
●●
●●●

●●
●●

●●
●●

●●



●●●



●●


●●

●●●
●●●
● ●
●●
●●

●●
●●●
●●

●●

●●
●●
●●


●●

●●
●●
●●
●●


●●
●●


●●


●●

●●
●●

●● ● ● ● ● ●
Ymid
0.8

● ● ● ● ● 0.35 0.42
0.0

●●
●●
●●

●●




●●



●●
●●


●●

●●

●●



●●
●●
●●

●●
●●


●●

●●


●●
●●●


●●
●●


●●

●●

●● ●


●●

●●
●●

●●●

●●
●●
●●

●●
●●
●●


●●
●●
●●●

●●
●●

●●

●●
●●●


●●


●● ● ● ● ● ● ●


●●


●●
●●


●●

●●

●●
●●
● ●

●●

●●


●●
●●


●●



●●


●●

●●

●●
●●

●●● ● ● ● ● ●
Xhigh

0.8
0.35
● ● ●● ●● ● ●

0.0
●●
●●
●●

●●



●●●


●●●
●●

●●●

●●
●●

●●
●●


●●
●●

●●
●●
●●●
●●●

●●●
●●

●●
●●●
●●

●●

●●
●●

●●
●●

●●

●●

●●
●●


●●
●●

●●


●●
● ●


●●

●●
●●

●●●

●●
●●
●●

●●
●●
●●


●●
●●
●●●

●●
●●

●●

●●
●●●●

●●

●●
●●
●●●

●●

●●●
●●

●●
●●


●●
●●

●●

●●

●●


●●
●●

●●

●●


●●


●●
●●● ● ● ● ● ● ●

●●

●●

●●

●●


●●

●●
●●


●●

●●

●●●
●●
● ●


●●


●●

●●

●●
●●● ● ● ● ● ● ● ● ●
Yhigh
0.8

● ● ●● ●● ● ● ●
0.0

●●
●●
●●

●●



●●●


●●●
●●


●●
●●

●●
●●


●●
●●

●●
●●
●●●
●●●

●●●

●●

●●
●●

●●
●●

●●

●●

●●
●●●


●●
●●

●●

●●
●●


●●

●●


●●


●●

●●

●● ●


●●

●●
●●

●●●

●●
●●
●●

●●
●●
●●


●●
●●
●●●

●●
●●

●●

●●
●●●●

●●

●●
●●
●●
●●●

●●

●●

●●
●●


●●
●●

●●
●●
●●
●●●

●●


●●


●● ● ● ● ● ● ● ● ●

−2 0.0 0.8 0.0 0.8 0.0 0.8

Fig. 4.7 The φ coefficient is a Pearson correlation applied to dichotomous data. It should not be used
as a short cut for the relationship between two continuous variables. φ is sensitive to the marginal
frequencies of the two variables. Shown is the SPLOM of two continuous variables, X and Y, and
dichotomous variables formed from cutting X and Y at -1, 0 or 1 standard devations from the mean.
Note how the φ coefficients underestimate the underlying correlation, particularly if the marginals
differ. The first two rows of the correlations are point bi-serial correlations between continuous X and
Y and dichotomized scores.

not applied. Redoing this analysis with the correction will yield somewhat different results.
Günther and Höfler (2006) give an example from a comorbidity study where applying or not
applying the correction makes a very large difference. Examples of the effect of the continuity
correction are in the help for tetrachoric.
This problem of differences in endorsement frequency (differences in marginals) will ad-
dressed again when considering factor analysis of items (6.6) where the results will be much
clearer when using tetrachoric correlations.
4.5 Other estimates of association 107

Table 4.12 The effect of various cut points upon polychoric and phi correlations. The original data
matrix is created for X and Y using the rmvnorm function with a specified mean and covariance structure.
Three dichotomous versions of X and Y are formed by cutting the data at -1, 0, or 1 standard deviations
from the mean. The tetrachoric correlations are found using the tetrachoric function. These are shown
below the diagonal by using lower.tri. Similarly, by using upper.tri the entries above the diagonal
are phi correlations which are, of course, just the standard Pearson correlation applied to dichotomous
data. The empirical thresholds, τ, are close to the -1, 0, and 1 cutpoints. The data are also tabled to
show the effect of extreme cuts. Compare these correlations with those shown in Figure 4.7.

> library(psych)
> library(mvtnorm) #needed for rmvorm
> set.seed(17) #to reproduce the results
> cut <- c(-1,-1,0,0,1,1)
> D <- rmvnorm(n=1000, sigma=Sigma) #create the data
> D3 <- cbind(D,D,D)
> d <- D3
> D3[t(t(d) > cut)] <- 1
> D3[t(t(d) <= cut)] <- 0
> xy <- D3[,c(1,3,5,2,4,6)]
> colnames(xy) <- c("Xlow","Xmid","Xhigh","Ylow","Ymid","Yhigh")
> #describe(xy)
> tet.mat <- tetrachor(xy,FALSE) #don't correct for continuity
> phi.mat <- cor(xy)
> both.mat <- tet.mat$rho * lower.tri(tet.mat$rho,TRUE) + phi.mat * upper.tri(phi.mat) #combine them
> round(both.mat,2)
> round(tet.mat$tau,2) #thresholds

Xlow Xmid Xhigh Ylow Ymid Yhigh


Xlow 1.00 0.44 0.19 0.35 0.30 0.18
Xmid 0.99 1.00 0.43 0.34 0.41 0.36
Xhigh 0.97 0.99 1.00 0.19 0.32 0.40
Ylow 0.60 0.64 0.66 1.00 0.44 0.20
Ymid 0.57 0.60 0.61 0.99 1.00 0.44
Yhigh 0.60 0.70 0.65 0.97 0.99 1.00

Xlow Xmid Xhigh Ylow Ymid Yhigh


-0.99 -0.02 0.99 -0.97 0.00 0.97
> table(xy[,2],xy[,5]) #both middle range

0 1
0 350 144
1 150 356
> table(xy[,3],xy[,4]) #x high, y low

0 1
0 164 676
1 1 159
> table(xy[,1],xy[,3]) #both low range

0 1
0 162 0
1 678 160
> table(xy[,1],xy[,6]) #x low, y high

0 1
0 160 2
1 675 163
108 4 Covariance, Regression, and Correlation

dnorm(x)
X>!

X<! X>!
Y>!
2

Y>! Y>!

!
1

rho = 0.8

x1
Y

0
-1

X<! X>!
-2

Y<! Y<!
-3

-3 -2 -1 0 1 2 3

Fig. 4.8 The tetrachoric correlation is estimated from the marginal distributions of x and y and well
as the joint frequency of x and y. The maximum likelihood estimate assumes bivariate normality.

4.5.1.5 Biserial and polyserial correlations: An estimated Pearson correlation of


a continuous variable with a ordered categorical variable

While the point biserial correlation and the φ coefficient are equivalent to a Pearson r, the
biserial correlation and polyserial correlation are not. The point biserial is just a short cut
formula (Equation 4.26) for the Pearson r where one of the two variables, Y, is continuous
and the other, X, is dichotomous. If, however, the dichotomous variable is assumed to be a
dichotomy of a normally distributed variable divided at a particular cut point into two levels
(0 and 1) with probabilities of q and p, then the biserial correlation (rbis ) is

Y¯2 − Y¯1 pq
rbis = (4.32)
σy z p

where z p is the ordinate of the normal curve for the item threshold , τ, where τ is the cumu-
lative normal equivalent of the probability, p. Thus,
4.6 Other measures of association 109

pq
rbis = r pb . (4.33)
zp

The use of the biserial correlation is highly discouraged by some (e.g., Nunnally, 1967), and
recommend to be used with extreme caution by others (Nunnally and Bernstein, 1984) but
is probably appropriate in the case of modeling the underlying latent relationships between
dichotomous items and continuous measures when the sample size is not too small. Biserial
correlations may be found by using the biserial function. When applied to the data of
Table 4.9, the biserial correlation is .70 compared to the observed Pearson (and thus point-
biserial) correlation of .54. Examining Equation 4.33, it is clear that rbis > r pb and that even in
the best case (a 50-50 split), the point-biserial will be just 80% of the underlying correlation.
The biserial correlation is a special case of the polyserial correlation, r ps which is the
estimate of a Pearson correlation of two continuous variables when one is continuous and the
other is an ordered categorical variable (say the four - six levels of a personality or mood
item). For a continuous X and a ordered categorical Y, the simple “ad hoc” estimator of
r ps (Olsson et al., 1982) is a function of the observed point-polyserial correlation (which is
just the Pearson r), the standard deviation of y, and the normal ordinates of the cumulative
normal values of the probabilities of the alternatives:
rxy σy
r ps = . (4.34)
∑ z pi

This “ad hoc” estimator is simple to find and is a close approximation to that found by
maximum likelihood. Just as with the biserial and point-biserial correlations, the polyserial
correlation will be greater than the equivalent point-polyserial.

4.5.1.6 Correlation and comorbidity

In medicine and clinical psychology, diagnoses tend to be categorical (someone is depressed


or not, someone has an anxiety disorder or not). Co-occurrence of both of these symptoms
is called comorbidity. Diagnostic categories vary in their degree of comorbidity with other
diagnostic categories. From the point of view of correlation, comorbidity is just a name applied
to one cell in a four fold table. It is thus possible to analyze comorbidity rates by considering
the probability of the separate diagnoses and the probability of the joint diagnosis. This gives
the two by two table needed for a φ or rtet correlation. Table 4.13 gives an example using the
comorbidity function.

4.6 Other measures of association

Although most of psychometrics is concerned with combining and partitioning variances and
covariances and the resulting correlations in the manner developed by Galton (1888), Pearson
(1895) and Spearman (1904b), it is useful to consider other measures of association that are
used in various applied settings. The first set of these are concerned with naturally occurring
dichotomies while a second set has to do with measuring the association between categorical
variables (e.g., diagnostic categories). A third set of correlations are those measuring asso-
110 4 Covariance, Regression, and Correlation

Table 4.13 Given the base rates (proportions) of two diagnostic categories (e.g., .2 and .15) and
their co-occurence (comorbidity, e.g. .1), it is straightforward to find the correlation between the two
diagnoses. The tetrachoric coefficient is most appropriate for subsequent analysis.

> comorbidity(.2,.15,.1,c("Anxiety","Depression"))

Call: comorbidity(d1 = 0.2, d2 = 0.15, com = 0.1, labels = c("Anxiety",


"Depression"))
Comorbidity table
Anxiety -Anxiety
Depression 0.1 0.05
-Depression 0.1 0.75

implies phi = 0.49 with Yule = 0.87 and tetrachoric correlation of 0.75

ciations within classes of equivalent measures and uses an analysis of variance approach to
find the appropriate coefficients.

4.6.1 Naturally dichotomous data

There are many variables, particularly those that reflect actions that are dichotomous (giving
a vaccine, admitting to graduate school, diagnosing a disease). Similarly, there are many
outcomes of these actions that are also dichotomous (surviving vs. dieing, finishing the Ph.D
or not, having a disease or not). Although Pearson argued that the latent relationship was
best descried as bivariate normal, and thus the appropriate statistic would be the tetrachoric
correlation, Yule (1912) and others have examined measures of relationship that do not
assume normality. Table 4.14, adapted from Yule (1912) provides the four cell entries that
enter into multiple estimates of associations. Pearson and Heron (1913) responded to Yule
(1912) and showed that even with extreme non-normality, the phi and tetrachoric correlations
were superior to others that had been proposed.

Table 4.14 Two dichotomous variables produce four outcomes. Yule (1912) used the example of vac-
cinated and not vaccinated crossed with survived or dead. Similarly, colleges accept or reject applicants
who either do or do not graduate.

Action non-action total


Positive Outcome a b a+b
Negative Outcome c d c+d
total a+c b+d a+b+c+d

Given such table, there are a number of measures of association that have been or are
being used.
4.6 Other measures of association 111

4.6.1.1 Odds ratios, risk ratios, and the problem of base rates

From a patient’s point of view, it would seem informative to know the ratio of how many
survived versus how many died given a vaccine (a/c). But this ratio is only meaningful in
contrast to the ratio of how many survive who are not vaccinated (b/d). The Odds Ratio
compares these two odds
a/c ad
OR = = (4.35)
b/d bc
Unfortunately, if the number of cases is small, and if either the b or c cells is empty, the OR
is infinite. A standard solution in this case is to add .5 to all cells. For cell sizes of na ...nd ,
the standard error of the logarithm of the odds ratio is (Fleiss, 1993)

1 1 1 1
SE(ln(OR)) = + + + .
na nb nc nd
An alternative to the Odds Ratio is the Risk Ratio. For instance, what fraction of patients
a
given a medication survive , or if a test is given for diagnostic reasons, what percentage
a+c
of patients with the disease (a+c) test positive for the disease (a). This is known as the
sensitivity of the test. But it is also important to know what percentage of patients without
the disease test negative (the specficity of the test). More informative is the Relative Risk
Ratio (comparing the risk given the action to the risk not given the action). That is, what is
the ratio of patients who survive given treatment to those who survive not given treatment.

a/(a + c) sensitivity a(b + d)


RRR = = = (4.36)
b/(b + d) 1 − speci f icity b(a + c)

Odds ratios and relative risk ratios along with confidence intervals may be found using the
epidemiological packages Epi and epitools. Just as the regressions of Y on X and X on Y
yield different slopes, so does the odds ratio depend upon the direction of prediction. That
is, the odds of a positive outcome given an action (a/c) is not the same as the odds of
an action having happened given a positive outcome (a/b). This difference depending upon
direction can lead to serious confusion, for many phenomena that seem highly related in
one direction have only small odds ratios in the opposite direction. This is important to
realize, and somewhat surprising, that the frequency of observing an event associated with a
particular condition (having lung cancer and having been a smoker, having an auto accident
and having been drinking, being pregnant and having had recent sexual intercourse) is much
higher than the frequency in the reverse direction (the percentage of smokers who have lung
cancer, the fraction of drivers who have been drinking who have accidents, the percentage
of women who have recently had sexual intercourse who are pregnant). Consider the case
of the relationship between sexual intercourse and pregnancy (Table 4.17). In this artificial
example, the odds of becoming pregnant given intercourse (a/(a+c)) are .0019, while the odds
of having had intercourse given that one is pregnant is 1.0. The odds ratio ad bc is undefined,
although adding .5 to all cells to correct for zero values, yields an odds ratio of 12.51. The
relative risk is also undefined unless .5 is added, in which case it becomes 12.49. The φ
coefficient is .04 while the tetrachoric correlation (found using the tetrachoric function) is
.938. But the latter makes the assumption of bivariate normality with extreme cut points. It
is not at all clear that this is appropriate for the data of 4.17.
112 4 Covariance, Regression, and Correlation

4.6.1.2 Yule’s Q and Y

Yule (1900, 1912) developed two measures of association, one of which, Q (for Quetelet),
may be seen as a transform of the Odds Ratio into a metric ranging from -1 to 1. Yule’s Q
statistic is
ad − bc ad/bc − 1 OR − 1
Q= = = . (4.37)
ad + bc ad/bc + 1 OR + 1
A related measure of association, the coefficient of colligation, called by Yule (1912) ω but
also known as the Yule’s Y is √
OR − 1
ω =Y = √ .
OR + 1
Yule’s coefficient has the advantage over the odds ratio that is defined even if one of the off
diagonal elements (b or c) is zero in which case Q = 1. However, if b or c is 0, then no matter
what the other cells are, Q will still be one. As a consequence, Yule’s Q was not uniformly
appreciated, and was strongly attacked by Pearson’s student Heron (1911); MacKenzie (1978)
as not consistent with φ or the tetrachoric correlation. In a very long article Pearson and
Heron (1913) gives many examples of the problems with Yule’s coefficient and why φ gives
more meaningful results. In their paper, Pearson and Heron also consider the problem with
φ , which is that it is limited by differences in the marginal frequencies. Both the Q and Y
statistics are found in the Yule function.

4.6.1.3 Even more measures of association for dichotomous data

In psychometrics, the preferred measures of association for dichotomous data is either the
tetrachoric correlation (based upon normal theory) or a Pearson correlation, which when
applied to dichotomous data is the phi coefficient. However, in other fields a number of mea-
sures of similarity have been developed. Jackson et al. (1989) distinguishes between measures
of co-occurence and measures of association. The phi , tetrachoric, and Yule coefficients are
measures of association, while there are at least 8 measures of similarity. The oldest, Jaccard’s
coefficient of community was developed to measure the frequency with which two species co-
occured across various regions of the Jura mountains and is just the number of co-occurences
(a) divided by the number of species in one or the other or both districts (a + b + c).
When these measures are rescaled so that their maximum for perfect association is 1,
and the index when there is no association is zero, most of these indices are equivalent to
complete Loevinger’s H statistic (Loevinger, 1948; Warrens, 2008).
These different measures of similarity typically are used in fields where the clustering of
objects (not variables) is important. That most of them are just transforms of Loevinger’s H
suggests that there is less need to consider each one separately (Warrens, 2008).

4.6.1.4 Base rates and inference

In addition to the problem of direction of inference is the problem of base rates. Even for
a d
tests with high sensitivities ( ) and specificities ( ), if the base rates are extreme,
a+c b+d
misdiagnosis is common. Consider the well known example of a very sensitive and specific
test for HIV/AIDS. Let these two values be .99. That is, out of 100 people with HIV/AIDS,
4.6 Other measures of association 113

Table 4.15 It is important to realize, and somewhat surprising, that the frequency of observing an
event associated with a particular condition (e.g., lung cancer and smoking, auto accidents and drinking,
pregnancy and sexual intercourse) are very different from the inverse (e.g. smoking and lung cancer,
drinking and auto accidents, sexual intercourse and pregnancy). In this hypothetical example, a couple
is assumed to have had sexual intercourse twice a week for ten years and to have had two children.
From the two x two table, it is possible calculate the tetrachoric correlation using the polychor or
tetrachoric functions, Yule’s Q using Yule, the φ correlation using the phi, and Cohen’s kappa using
cohen.kappa function as well .

Intercourse No intercourse total


Pregnant 2 0 2
Not pregnant 1038 2598 3638
total 1040 2600 3640

> pregnant = matrix(c(2,0,1038,2598),ncol=2)


> colnames(pregnant) <- c("sex","nosex")
> rownames(pregnant) <- c("yes","no")
> pregnant

sex nosex
yes 2 1038
no 0 2598
> polychor(pregnant)
[1] 0.9387769
> Yule(pregnant)
[1] 1

> phi(pregnant)

[1] 0.0371

> wkappa(pregnant)

$kappa
[1] 0.002744388

Table 4.16 Alternative measures of co-occurence for binary data (adapted from Jackson et al. (1989);
Warrens (2008)).

Coefficient Index Reference


a
Jaccard a+b+c Jaccard (1901)

2a
Sorenson-Dice 2a+b+c ??

a
Russell-Rao a+b+c+d ?

a+b
Sokal 2a+b+c ?

√ a
Ochiai ?
(a+b)(a+c)

etc.
114 4 Covariance, Regression, and Correlation

the test correctly diagnoses 99 of them. Of 100 people for whom the test returns a negative
result, 1 has the disease. Assume that 1% of a sample of 10,000 people are truly infected
with HIV/AIDS. What percentage of the sample will test positive? What is the likelihood of
having the disease if the test returns positive? The answer is that roughly 2% of the sample
tests positive and half of those are false positives.

Table 4.17 High specificity and sensitivity do not necessarily imply a low rate of false positives or
negatives if the base rates are extreme. Even with specificities and sensitivies of 99%, 50% of those
diagnosed positive are false positives.

HIV/AIDS Yes HIV/AIDS No total


Test Positive 99 99 198
Test Negative 1 2598 9802
total 100 9,900 10,000

4.6.2 Measures of association for categorical data

Some categorical judgments are made using more than two outcomes. For example, two
diagnosticians might be asked to categorize patients three ways (e.g., Personality disorder,
Neurosis, Psychosis) or to rate the severity of a tumor (not present, present but benign,
serious, very serious). Just as base rates affect observed cell frequencies in a two by two table,
they need to be considered in the n way table Cohen (1960). Consider Table 4.18 which is
adapted from Cohen (1968). Let O be the matrix of observed frequencies and E = RC where
R and C are the row and column (marginal) frequencies. Kappa corrects the proportion of
matches, po , (the number of times the judges agree) with what would be expected by chance,
pe (the sum of the diagonal of the product of the row and column frequencies). Thus

po − pe tr(O − E)
κ= = . (4.38)
1 − pe 1 − trE

Table 4.18 Cohen’s kappa measures agreement for n-way tables for two judges. It compares the
observed frequencies on the main diagonal with those expected given the row marginals. The ex-
pected scores (shown in parentheses) are the products of the marginals. κ = (.44+.20+.06)−(.30+.09+.02)
1−(.30+.09+.02) = .49.
Adapted from Cohen (1968).

Judge 1
Personality disorder Neurosis Psychosis pi.
Personality disorder .44 (.30) .07 (.18) .09 (.12) .60
Judge 2 Neurosis .05 (.15) .20 (.09) .05 (.06) .30
Psychosis .01 (.05) .03 (.03) .06 (.02) .10
p. j .50 .30 .20 1.00
4.6 Other measures of association 115

As discussed by Hubert (1977) and Zwick (1988), kappa is one of a family of statistics that
correct observed agreement with expected agreement. If raters are assumed to be random
samples from a pool of raters, then the marginal probabilities for each rater may be averaged
and the expected values will be the squared marginals (Scott, 1955). kappa does not assume
equal marginal frequencies and follows the χ 2 logic of finding expectancies based upon the
product of the marginal frequencies. However, kappa considers all disagreements to be equally
important and just considers entries on the main diagonal.
If some disagreements are more important than others, then the appropriate measure is
weighted kappa (Cohen, 1968):
wpo − wpe
κw = (4.39)
1 − wpe
where wpo = ∑ wi j poi j and similarly wpe = ∑ wi j pei j . With the addition of a weighting function,
wi j that weights the diagonal 1 and the off diagonal with weights depending upon the inverse
of the squared distance between the categories, the weighted kappa coefficient is equivalent
to one form of the intraclass correlation coefficient (see 4.6.3) (Fleiss and Cohen, 1973).
Weighted Kappa is particularly appropriate when the categories are ordinal and a near miss
is less important than a big miss (i.e., having one judge give a medical severity rating of not
present and the other judge rating the same case as very serious shows less agreement than
one judge giving a very serious and the other a serious rating). Another use of weighted kappa
is even if the categories are not ordinal, some mistakes are more important than others.
The variance of kappa or weighted kappa for large samples may be found using formulas
in Fleiss et al. (1969). By using the resulting standard errors, it is possible to find confidence
intervals for kappa and weighted kappa (Hubert, 1977; Fleiss et al., 1969). Calculations of
kappa and weighted kappa are done in several packages: Kappa in vcd (a very nice package
for Visualizing Categorical Data), wkappa in psy, and cohen.kappa in psych.

4.6.3 Intraclass Correlation

The Pearson correlation coefficient measures similarity of patterns of two distinct variables
across people. The variables are two measures (say height and weight) on the same set of
people, and the two variables are logically distinct. But sometimes it is desired to measure
how similar pairs (or more) of people are on one variable. Consider the problem of similarity
of pairs twins on a measure of ability (Table 4.19). For five pairs of twins, they may be
assigned to be the first or second twin based upon observed score (Twin 1 and Twin 2), or
as they are sampled (Twin 1* and Twin 2*). The correlation between the twins in the first

Table 4.19 Hypothetical twin data. The Twin 1 and Twin columns have been ordered by the value
of the lower scoring twin, the Twin 1* and Twin 2* columns suggest what happens if the twins are
randomly assigned to twin number.

Pair Twin 1 Twin 2 Twin 1* Twin 2*


1 80 90 80 90
2 90 100 100 90
3 100 110 110 100
4 110 120 110 120
5 120 130 130 120
116 4 Covariance, Regression, and Correlation

two columns is 1, but between the second two sets of columns it is .80. That the twins in
the first two columns are not perfectly similar is obvious, in that their scores systematically
differ by 10 points. The normal correlation, by removing the means of the scores, does not
detect this effect. One solution, sometimes seen in the early behavior genetics literature was
to double enter each twin, that is, to have each twin appear once in the first column and
once in the second column. This effectively pools the mean of the two sets of twins and finds
the correlation with respect to deviations from this pooled mean. The value in this case is
.77. Why do these three correlations differ and what is is the correct value of the similarity
of the twins?
The answer comes from the intraclass correlation coefficient, or ICC and a consideration
of the sources of variance going into the twin scores. Consider the traditional analysis of
variance model:
xi j = µ + ai + b j + (ab)i j + ei j
where µ is the overall mean for all twins, ai is the mean for the ith pair, b j is the mean for
the first or second column, abi j reflects the interaction of particular twin pair and being in
column 1 or 2, and ei j is residual error. In the case of twins, abi j and ei j are indistinguishable
and may be combined as wi j . Then the total variance σt2 may be decomposed

σt2 = σi2 + σ 2j + σw2 .

and the fraction of total variance (between + within pair variance) due to difference between
the twin pairs is the intraclass correlation measure of similarity:

σi2 σi2
ρ= 2
= 2 .
σt σi + σ 2j + σw2

An equivalent problem is the problem of estimating the agreement in their ratings between
two or more raters. Consider the ratings of six targets by four raters shown in Table 4.20.
Although the raters have an average intercorrelation of .76, they differ drastically in their
mean ratings and one (rater 4) has a much higher variance. As reviewed by Shrout and Fleiss
(1979) there are at least six different intraclass correlations that are commonly used when
considering the agreement between k different judges (raters) of n different targets:
1. Case 1: Targets are randomly assigned to different judges. (This would be equivalent to
the twins case above).
2. Case 2: All targets are rated by the same set of randomly chosen judges.
3. Case 3: All targets are rated by the same set of fixed judges.
4. Case 1-b:The expected correlation of the average ratings across targets of the mean ratings
of randomly assigned judges with another set of such measures.
5. Case 2-b: The expected correlation of the average ratings across targets from one set of
randomly chosen judges with another set.
6. Case 3-b: The expected correlation of the average ratings across targets of fixed judges.
All six of these intraclass correlations may be estimated by standard analysis of variance
implemented in the ICC function in psych. If the ratings are numerical rather than categorical,
the ICC is to be preferred to κ or weighted κ which were discussed above (4.6.2).
4.6 Other measures of association 117

Table 4.20 Example data from four raters for six targets demonstrate the use of the intraclass cor-
relation coefficient. Adapted from Shrout and Fleiss (1979)

Subject Rater 1 Rater 2 Rater 3 Rater 4


1 9 2 5 8
2 6 1 3 2
3 8 4 6 8
4 7 1 2 6
5 10 5 6 9
6 6 2 4 7

> round((sum(cor(SF79)) - 4)/12,2)

[1] 0.76

> describe(SF79)

var n mean sd median trimmed mad min max range skew kurtosis se
V1 1 6 7.67 1.63 7.5 7.67 2.22 6 10 4 0.21 -1.86 0.67
V2 2 6 2.50 1.64 2.0 2.50 1.48 1 5 4 0.45 -1.76 0.67
V3 3 6 4.33 1.63 4.5 4.33 2.22 2 6 4 -0.21 -1.86 0.67
V4 4 6 6.67 2.50 7.5 6.67 1.48 2 9 7 -0.90 -0.83 1.02

> ICC(SF79)

ICC1 ICC2 ICC3 ICC12 ICC22 ICC32


[1,] 0.17 0.29 0.71 0.44 0.62 0.91

4.6.4 Quantile Regression

Originally introduced by Galton (1889), regression of deviations from the median in terms
of quantile units has been rediscovered in the past decade Gilchrist (2005). The package
quantreg by Koenker (2007) implements these procedures.

4.6.5 Kendall’s Tau

τ is a rank order correlation based on the number of concordant (same rank order) and
disconcordant (different rank order) pairs (Dalgaard, 2002). If there are no ties in the ranks
for the xi and yi
∑i< j sign(x j − xi ) ∗ sign(y j − yi )
τ= .
n(n − 1)/2
τ counts the number of pairs that have the same rank orders and compares this to the number
of pairs. If two vectors, x and y, are monotonically the same, τ will be one. Kendall is an
option in the cor function in base R and is also available as the Kendall function in the
Kendall package.
118 4 Covariance, Regression, and Correlation

4.6.6 Circular-circular and circular-linear correlations

As discussed earlier (3.4.1), when data represent angles (such as the hours of peak alertness
or peak tension during the day), we need to apply circular statistics rather than the more
normal linear statistics (see Jammalamadaka and Lund (2006) for a very clear set of exam-
ples of circular statistics). The generalization of the Pearson correlation to circular statistics
is straight forward and is implemented in cor.circular in the circular package and in cir-
cadian.cor in the psych package. Just as the Pearson r is a ratio of covariance to the square
root of the product of two variances, so is the circular correlation. The circular covariance
of two circular vectors is defined as the average product of the sines of the deviations from
the circular mean. The variance is thus the average squared sine of the angular deviations
from the circular mean.
Consider the data shown in Table 3.8. Although the Pearson r of these variables range
from -.78 to .06, the circular correlations among all of them are exactly 1.0. (The separate
columns are just phase shifted by 5 hours and thus the deviations from the circular means
are identical.)
In addition to the circular-circular correlation, there is also the correlation between a
circular variable and a linear variable (the circular-linear correlation). The circular-linear
covariance is the product of the sine of the angular deviation from the circular mean times
the deviation of the linear variable from its mean. It may be found by the cor.circular
or the circadian.linear.cor functions. In the example in Table 4.21, the circular variable
of hour of peak mood for Tense Arousal has a perfect positive circular-linear correlation
with the linear variable, Extraversion, and a slight positive correlation with Neuroticism. By
comparison, the traditional, Pearson correlations for these variables were -.78 and -.18.

4.7 Alternative estimates of effect size

There are a number of ways to summarize the importance of a relationship. The slope of the
linear regression, by.x is a direct measure of how much one variable changes as a function of
changes in the other variable. The regression is in the units of the measure. Thus, Galton
could say that the height of children increased by .65 inches (or centimeters) for every increase
in 1 inch (or centimeter) of the mid parent. As a measure of effect with meaningful units,
the slope is probably the most interpretable.
But, for much of psychological data, the units are arbitrary, and discussing scores in
terms of deviations from the mean with respect to the standard deviation (i.e., standard
scores) is more appropriate. In this case, the correlation is the preferred unit of effect. Using
correlations, Galton would have said that the relationship between mid parent and child
height was .46. Experimentalists tend to think of the effects in terms of differences between
the means of two groups, the two standard estimates of effect size of group differences are
Cohen’s d (Cohen, 1988) and Hedges’ g (Hedges and Olkin, 1985), both of which compare
the mean difference to estimates of the within cell standard deviation. Cohen’s d uses the
population estimate, Hedge’s g the sample estimate. Useful reviews of the use of these and
other ways of estimating effect sizes for meta-analysis include Rosnow et al. (2000) and the
special issue of Psychological Methods devoted to effect sizes Becker (2003). Summaries of
these various formulae are in Table 4.22.
4.8 Sources of confusion 119

Table 4.21 The Pearson correlation for circular data misrepresents the relationships between data
that have a circular order (such as time of day, week, or year). The circular correlation considers the
sine of deviations from the circular mean. The correlation between a linear variable (e.g., extraversion
or neuroticism) with a circular variable is found using the circular-linear correlation.

> time.person #the raw data (four circular variables, two linear variables)

EA PA TA NegA extraversion neuroticism


1 9 14 19 24 1 3
2 11 16 21 2 2 6
3 13 18 23 4 3 1
4 15 20 1 6 4 4
5 17 22 3 8 5 5
6 19 24 5 10 6 2

> round(cor(time.person),2) #the Pearson correlations

EA PA TA NegA extraversion neuroticism


EA 1.00 1.00 -0.78 -0.34 1.00 -0.14
PA 1.00 1.00 -0.78 -0.34 1.00 -0.14
TA -0.78 -0.78 1.00 0.06 -0.78 -0.18
NegA -0.34 -0.34 0.06 1.00 -0.34 -0.23
extraversion 1.00 1.00 -0.78 -0.34 1.00 -0.14
neuroticism -0.14 -0.14 -0.18 -0.23 -0.14 1.00

> circadian.cor(time.person[1:4]) # the circular correlations


> round(circadian.linear.cor(time.person[1:4],time.person[5:6]),2)

EA PA TA NegA
EA 1 1 1 1
PA 1 1 1 1
TA 1 1 1 1
NegA 1 1 1 1

extraversion neuroticism
EA 1 0.18
PA 1 0.18
TA 1 0.18
NegA 1 0.18

4.8 Sources of confusion

The correlation coefficient, while an extremely useful measure of the relationship between
two variables, can sometimes lead to improper conclusions. Several of these are discussed in
more detail below. One of the most common problems is restriction of range of either one of
the two variables. The use of sums, ratios, or differences can also lead to spurious correlations
when none are truly present. Some investigators will ipsatize scores either intentionally or
non-intentionally and discover that correlations of related constructs are seriously reduced.
Simpson’s paradox is a case of correlations between data measured at one level being reversed
when pooling data across a grouping variable at a different level. The importance of a correla-
tion for practical purposes is also frequently dismissed by reflexively squaring the correlation
to understand the reduction in variance accounted for by the correlation. In practical decision
making situations, the slope of the linear relationship between two variables is much more
120 4 Covariance, Regression, and Correlation

Table 4.22 Alternative Estimates of effect size. Using the correlation as a scale free estimate of effect
size allows for combining experimental and correlational data in a metric that is directly interpretable
as the effect of a standardized unit change in x leads to r change in standardized y.

Regression by.x = Cxy


σx2
by.x by.x = r σσxy
C
Pearson correlation rxy = σxxyσy rxy
Cohen’s d d = X1σ−X 2
r = √ d2 d = √ 2r
x d +4 1−r2
Hedge’s g X1 −X2
g = sx r= √2 g g=
g +4(d f /N)
√ � �
r2 d f
t - test t = 2d d f r = t 2 /(t 2 + d f ) t= 1−r2
� r2 d f
F-test F = 4d 2 d f r = F/(F + d f ) F = 1−r 2

Chi Square r = χ 2 /n χ 2 = r2 n
ln(OR)
Odds ratio d= 1.81 r= √ ln(OR) 2
ln(OR) = √3.62r 2
1.81 (ln(OR)/1.81) +4 1−r
requivalent r with probability p r = requivalent

important than the squared correlation and it is more appropriate to consider the slope of
the mean differences between groups (Lubinski and Humphreys, 1996). Finally, correlations
can be seriously attenuated by differences in skew between different sets of variables.

4.8.1 Restriction of range

The correlation is a ratio of covariance to the square root of the product of two variances
4.8. As such, if the variance of the predictor is artificially constrained, the correlation will be
reduced, even though the slope of the regression remains the same. Consider an example of
1,000 simulated students with GREV and GREQ scores with a population correlation of .6.
If the sample is restricted in its variance (say only students with GREV > 600 are allowed
to apply, the correlation drops by almost 1/2 from .61 to .34.(Table 4.23, Figure 4.9).
An even more serious problem occurs if the range is restricted based upon the sum of the
two variables. This might be the case if an admissions committee based their decisions upon
total GRE scores and then examined the correlation between their predictors. Consider the
correlation within those applicants who had total scores of more than 1400. In this case, the
correlation for those 11 hypothetical subjects has become -.34 even though the underlying
correlation was .61! Similar problems will occur when choosing a high group based upon
several measures of a related concept. Some researchers examine the relationship among
measures of negative affecting with a group chosen to be extreme on the trait. That is,
what is the correlation between measures of neuroticism, anxiety, and depression within a
selected set of patients rather than the general population. Consider the data set epi.bfi
which includes measures of Neuroticism using the Eysenck Personality Inventory (Eysenck
and Eysenck, 1964), of Depression using the Beck Depression Inventory Beck et al. (1961)
and Trait Anxiety using the State Trait Anxiety Inventory (Spielberger et al., 1970) for 231
undergraduates. For the total sample, these three measures have correlations of .53, .73 and
.65, but if a broad trait of negative affectivity is defined as the sum of the three standardized
scales, and an “at risk” group is defined as more than 1 s.d. on this composite is chosen, the
correlations become -.08, -.11, and .17.
4.8 Sources of confusion 121

Table 4.23 Restricting the range of one variable will reduce the correlation, although not change the
regression slope. Simulated data are generated using the mvrnorm function from the MASS package.
To give the example some context, the variables may be taken to represent “GRE Verbal” and “GRE
Quantitative”. The results are shown in Figure 4.9.

library(MASS)
set.seed(42)
GRE <- data.frame(mvrnorm(1000,c(500,500),matrix(c(10000,6000,6000,10000),ncol=2)))
colnames(GRE) <- c("GRE.V","GRE.Q")
op <- par(mfrow = c(1,2))
plot(GRE,xlim=c(200,800),ylim=c(200,800),main="Unrestricted")
lmc <- lm(GRE.Q ~ GRE.V,data=GRE)
abline(lmc)
text(700,200,paste("r =",round(cor(GRE)[1,2],2)))
text(700,250,paste("b =",round(lmc$coefficients[2],2)))
GREs <- subset(GRE,GRE$GRE.V > 600)
plot(GREs,xlim=c(200,800),ylim=c(200,800),main="Range restricted")
lmc <- lm(GRE.Q ~ GRE.V,data=GREs)
abline(lmc)
text(700,200,paste("r =",round(cor(GREs)[1,2],2)))
text(700,250,paste("b =",round(lmc$coefficients[2],2)))

Unrestricted Range restricted


200 300 400 500 600 700 800

200 300 400 500 600 700 800


GRE.Q

GRE.Q

b = 0.61 b = 0.54
r = 0.61 r = 0.33

200 300 400 500 600 700 800 200 300 400 500 600 700 800

GRE.V GRE.V

Fig. 4.9 The effect of restriction of range on correlation and regression. If the predictor (X) variable is
restricted by selection, the regression slope does not change, but the correlation drops. Data generated
using mvrnorm as shown in Table 4.23.

4.8.2 Spurious correlations

Although viewing the correlation coefficient as perhaps his greatest accomplishment, Pearson
(1910) listed a number of sources of spurious correlations (Aldrich, 1995). Among these was
the problem of ratios and of sums, and of correlations induced by mixing different groups.
122 4 Covariance, Regression, and Correlation

4.8.2.1 The misuse of ratios, sums and differences

It is not uncommon to convert observations to ratios of the amount Xi with respect to some
common baseline, T. Consider the case of discretionary income spent on CDs, books, and
wine with respect to the total discretionary income, T. Although these four variables might
themselves be uncorrelated, by expressing them all as a ratio of a common variable, the ratios
are spuriously correlated (Table 4.24).
Just as forming ratios can induce spurious correlations, so can addition or subtraction.
Table 4.24 considers the difference between amount spent on CDs, Books, or Wine and
total income. Even though the raw data are uncorrelated, the differences are correlated. The
reason behind this is that 50% of the variance of the ratio or difference score is associated
with the common variable. Thus, the expected amount of variance beween two such ratios
or differences would be 25% for an expected correlation of .5.

Table 4.24 When expressing variables as a ratio or a sum or difference of two unrelated variable,
the ratios and differences are correlated, even though the variables themselves are not. The amount of
money spent on CDs, books or wine is unrelated to the other two and to total income. But the ratio
of amount spent on CDs, books, or wine as fraction of total income (CDsratio , etc) as well as the
differences in amount spent (CDs - Income) are correlated.

> x <- matrix(rnorm(1000),ncol=4) + 4


> colnames(x) <- c("CDs","Books","Wine","Income")
> x.df <- data.frame(x,x/x[,4],(x-x[,4]))
> colnames(x.df) <- c("CDs","Books","Wine","Income",
+ paste(c("CDs","Books","Wine","Income"),"ratio",sep=""),
+ paste(c("CDs","Books","Wine","Income"),"diff",sep=""))
> round(cor(x.df[-c(8,12)]),2)

CDs Books Wine Income CDsratio Booksratio Wineratio CDsdiff Booksdiff Winediff
CDs 1.00 0.00 -0.04 0.02 0.64 -0.04 -0.06 0.71 -0.02 -0.05
Books 0.00 1.00 -0.01 -0.03 0.00 0.61 0.00 0.02 0.72 0.01
Wine -0.04 -0.01 1.00 0.04 -0.07 -0.05 0.58 -0.06 -0.04 0.69
Income 0.02 -0.03 0.04 1.00 -0.68 -0.73 -0.72 -0.69 -0.71 -0.70
CDsratio 0.64 0.00 -0.07 -0.68 1.00 0.55 0.52 0.94 0.47 0.44
Booksratio -0.04 0.61 -0.05 -0.73 0.55 1.00 0.59 0.49 0.94 0.50
Wineratio -0.06 0.00 0.58 -0.72 0.52 0.59 1.00 0.46 0.49 0.94
CDsdiff 0.71 0.02 -0.06 -0.69 0.94 0.49 0.46 1.00 0.49 0.46
Booksdiff -0.02 0.72 -0.04 -0.71 0.47 0.94 0.49 0.49 1.00 0.49
Winediff -0.05 0.01 0.69 -0.70 0.44 0.50 0.94 0.46 0.49 1.00

4.8.2.2 Correlation induced by ipsatization and other devices

When studying individual differences in values (e.g, Allport et al., 1960; Hinz et al., 2005), it
is typical to ipsatize the scores (Cattell, 1945). That is, the total score of all the values is fixed
at a constant for all participants and an increase in one necessarily implies a decrease in the
others. Essentially this is zero centering the data for each participant. Psychologically this
means everyone has the same total of value strength. Even for truly uncorrelated variables,
ipsatization forces a correlation of -1/(k -1) for k variables and reduces the rank of the
correlation matrix by 1 (Dunlap and Cornwell, 1994).
4.8 Sources of confusion 123

This problem can occur in more subtle ways than just fixing the sum to be a constant.
Sometimes ratings are made on a forced choice basis (choose which behavior is being shown)
and leads to the strange conclusion that e.g., being friendly is unrelated to being sociable.
When allowing the ratings not to be forced choices but rather the amount of each behavior is
rated separately, the normal structure is observed (Romer and Revelle, 1984). This problem
also can be seen in cognitive psychology when raters are asked to choose which cognitive
process is being used, rather than how much each process is being used.

4.8.2.3 Simpson’s Paradox and the within versus between correlation problem

The confounding of group differences with within group relationships to produce spurious
overall relationships has plagued the use of correlation since it was developed. Consider
the classic example of inappropriately deciding that an antitoxin is effective even though in
reality it has no effect (Yule, 1912). If women have a higher mortality from a disease than
do men, but more men are given the antitoxin than the women, the pooled data would show
a favorable effect of the antitoxin, even though it in fact had no effect. Similarly, between
1974 and 1978 the tax rate decreased within each of several income categories, although the
overall tax rate increased Wagner (1982). So called ecologicial correlations Robinson (1950)
are correlations of group means and either can or can not reflect relationships within groups.
One of the most well known examples of this effect, known as Simpson’s paradox , where
relationships within groups can be in the opposite direction of the relationships for the entire
sample (Simpson, 1951) was found when studying graduate admissions to the University of
California, Berkeley. In 1973, UCB had 2691 male applicants and 1198 females applicants. Of
the males, about 44% were admitted, of the females, about 35%. What seems to be obvious
sex discrimination in admissions became a paper in Science when it was discovered that the
individual departments, if discriminating at all, discriminated in favor of women (Bickel et al.,
1975). The women were applying to the departments which admitted fewer applicants as a
percentage of applicants (i.e., two thirds of the applicants to English but only 2 percent to
mechanical engineering were women). The correlation across departments of percent female
applicants and difficulty of admission was .56. This data set UCBAdmissions is used as an
example of various graphical displays.
Problems similar to the UCB case can arise when pooling within subject effects across
subjects. For instance when examining the structure of affect the structure across subjects
is very different from the structure within subjects. Across subjects, positive and negative
affect are almost independent, while within subjects the correlation reliabily varies from
highly positive to highly negative (Rafaeli et al., 2007).

4.8.2.4 Correlations of means �= correlations of observations

There are other cases, however, when the correlations of group means clarifies the importance
of the underlying relationship (Lubinski and Humphreys, 1996). elaborate
124 4 Covariance, Regression, and Correlation

4.8.2.5 Base rates and skew

A difficulty with the Pearson correlation (but not rank order correlations such as Spearman’s
ρ) is when the data differ in the amount of skew . This problem arises, for example when
examining the structure of measures of positive and negative affect (Rafaeli and Revelle,
2006), or when looking for correlates of psychopathology. The Pearson r can be greatly
attenuated with large differences in skew. Correlations of rank orders (i.e., Spearman’s rho)
do not suffer from this problem. Consider a simple example of two bivariate normal variables,
x and y with a population correlation of .71. Consider also various transformations of these
original data to have positive and negative skew. log.x and log y are negatively skewed, -
log(-x) (log.nx) and -log(-y) have positive skew. Similarly the exponential of x (exp.x) and
y (exp.y) have very large positive skews while their negative inverses (exp.nx= −e−x ) have
large negative skews (Table 4.25, Figure 4.8.2.5).

0.0 1.5 0 1500 -3 0 -2.0 -0.5 -2000 0

3
0.97 0.97 0.76 0.77 0.71 0.69 0.70 0.58 0.53

0
-3
log.x
1.5

0.91 0.65 0.90 0.69 0.69 0.66 0.51 0.57


0.0

log.nx

0.0
0.89 0.65 0.70 0.66 0.71 0.63 0.47

-2.0
1500

exp.x
0.36 0.56 0.49 0.60 0.63 0.29
0

exp.nx

0
0.53 0.58 0.48 0.32 0.56

-1200
y
0.97 0.98 0.81 0.74
0
-3

log.y -1.0 1.0


0.91 0.69 0.88

log.ny
-0.5

0.91 0.64
-2.0

exp.y
600

0.39
0

exp.ny
0
-2000

-3 0 3 -2.0 0.0 -1200 0 -1.0 1.0 0 600

Fig. 4.10 Differences in skew can attenuate correlations. Two variables, x and y, are correlated .71 in
the bivariate normal case. Four variations of each and x and y are generated by log and exponential
transforms of original or reversed values to induce positive and negative skew.
4.8 Sources of confusion 125

Table 4.25 Descriptive statistics of skew example


> describe(skew.df)

> describe(skew.df)
var n mean sd median trimmed mad min max range skew kurtosis se
x 1 1000 0.00 1.03 0.01 0.00 0.96 -3.20 3.47 6.67 -0.05 0.07 0.03
log.x 2 1000 1.35 0.29 1.39 1.37 0.24 -0.22 2.01 2.23 -1.15 2.73 0.01
log.nx 3 1000 -1.35 0.29 -1.38 -1.37 0.24 -1.97 0.64 2.61 1.10 3.23 0.01
exp.x 4 1000 91.93 122.03 55.22 67.64 49.38 2.23 1756.20 1753.98 4.94 43.58 3.86
exp.nx 5 1000 -93.82 127.51 -53.98 -67.54 47.03 -1338.84 -1.70 1337.14 -4.34 27.74 4.03
y 6 1000 -0.02 0.99 -0.04 -0.03 1.02 -3.68 2.74 6.42 0.04 -0.08 0.03
log.y 7 1000 1.35 0.28 1.38 1.37 0.25 -1.14 1.91 3.05 -1.36 7.12 0.01
log.ny 8 1000 -1.36 0.27 -1.40 -1.38 0.25 -2.04 -0.23 1.81 0.91 1.40 0.01
exp.y 9 1000 87.51 106.14 52.40 65.79 45.79 1.38 844.76 843.38 3.39 15.45 3.36
exp.ny 10 1000 -90.05 119.15 -56.89 -69.21 49.25 -2167.17 -3.53 2163.64 -7.40 103.09 3.77

Examining the SPLOM it is clear that small differences in skew do not lead to a large
attenuation, but that as the differences in skew go up, particuarly if they are in opposite
directions, the correlations are seriously attenuated. This is true not just with the set of
transformations of each variable x with transformations of x, but even more serious when
examining the correlations between the transformed values x and y. For this particular ex-
ample, because the transformations were monotonic, the Spearman rho correlations correctly
were 1s within the x and y set, and .69 between.
When working with only a few levels rather than the many shown in Figure 4.8.2.5, the
problems of skew are also known as a problems of base rates. If the probability of success on
a task is much greater than the probability of failure, and the probability of a predictor of
success being positive is much than the probability of it being negative, then the dichotomous
variable of success/failure can not have a high correlation with the predictor, even if the
underlying relationship were perfect.

4.8.3 Non linearity, outliers and other problems: the importance


of graphics

Although not all threats to inference can be detected graphically, one of the most powerful
statistical tests for non-linearity and outliers is the well known but not often used “inter-
occular trauma test”. A classic example of the need to examine one’s data for the effect of non-
linearity and the effect of outliers is the data set of Anscombe (1973) which is included as the
data(anscombe) data set. This data set is striking for it shows four patterns of results, with
equal regressions and equal descriptive statistics. The graphs differ drastically in appearance
for one actually has a curvilinear relationship, two have one extreme score, and one shows
the expected pattern. Anscombe’s discussion of the importance of graphs is just as timely
now as it was 35 years ago:
Graphs can have various purposes, such as (i) to help us perceive and appreciate some broad
features of the data, (ii) to let us look behind these broad features and see what else is there.
Most kinds of statistical calculaton rest on assumptions about the behavior of the data. Those
assumptions may be false, and the calculations may be misleading. We ought always to try to
check whether the assumptions are reasonably correct; and if they are wrong we ought to be able
126 4 Covariance, Regression, and Correlation

to perceive in what ways the are wrong. Graphs are very valuable for these purposes. (Anscombe,
1973, p 17).

The next chapter will generalize the correlation coefficient from the case of two variables to
the case of multiple predictors (multiple R) and the problem of statistical control from one
or more variables when considering the relationship between variables. The problems that
arise in the two variable case are even more pronounced in the multiple variable case.
Chapter 5
Multiple correlation and multiple regression

The previous chapter considered how to determine the relationship between two variables
and how to predict one from the other. The general solution was to consider the ratio of
the covariance between two variables to the variance of the predictor variable (regression)
or the ratio of the covariance to the square root of the product the variances (correlation).
This solution may be generalized to the problem of how to predict a single variable from the
weighted linear sum of multiple variables (multiple regression) or to measure the strength of
this relationship (multiple correlation). As part of the problem of finding the weights, the
concepts of partial covariance and partial correlation will be introduced. To do all of this will
require finding the variance of a composite score, and the covariance of this composite with
another score, which might itself be a composite.
Much of psychometric theory is merely an extension, an elaboration, or a generalization
of these concepts. Almost all tests are composites of items or subtests. An understanding
how to decompose test variance into its component parts, and conversely, an understanding
how to analyze tests as composites of items, allows us to analyze the meaning of tests. But
tests are not merely composites of items. Tests relate to other tests. A deep appreciation of
the basic Pearson correlation coefficient facilitates an understanding of its generalization to
multiple and partial correlation, to factor analysis, and to questions of validity.

5.1 The variance of composites

If x1 and x2 are vectors of N observations centered around their mean (that is, deviation
scores) their variances are Vx1 = ∑ xi1 2 /(N − 1) and V = x2 /(N − 1), or, in matrix terms
x2 ∑ i2
Vx1 = x�1 x1 /(N − 1) and Vx2 = x�2 x2 /(N − 1). The variance of the composite made up of the sum
of the corresponding scores, x + y is just

∑(xi + yi )2 ∑ xi2 + ∑ y2i + 2 ∑ xi yi (x + y)� (x + y)


V(x1+x2) = = = . (5.1)
N −1 N −1 N −1
Generalizing 5.1 to the case of n xs, the composite matrix of these is just N Xn with dimensions
of N rows and n columns. The matrix of variances and covariances of the individual items of
this composite is written as S as it is a sample estimate of the population variance-covariance
matrix, Σ . It is perhaps helpful to view S in terms of its elements, n of which are variances

127
128 5 Multiple correlation and multiple regression

and n2 − n = n ∗ (n − 1) are covariances:


 
vx1 cx1x2 · · · cx1xn
 cx1x2 vx2 cx2xn 
 
S= .
 .. . . . ...  
cx1xn cx2xn · · · vxn

The diagonal of S = diag(S) is just the vector of individual variances. The trace of S is
the sum of the diagonals and will be used a great deal when considering how to estimate
reliability. It is convenient to represent the sum of all of the elements in the matrix, S, as the
variance of the composite matrix.

X� X 1� (X� X)1
VX = ∑ = .
N −1 N −1

5.2 Multiple regression

The problem of the optimal linear prediction of ŷ in terms of x may be generalized to the
problem of linearly predicting ŷ in terms of a composite variable X where X is made up of
individual variables x1 , x2 , ..., xn . Just as by .x = covxy /varx is the optimal slope for predicting
y, so it is possible to find a set of weights (β weights in the standardized case, b weights in
the unstandardized case) for each of the individual xi s.
Consider first the problem of two predictors, x1 and x2 , we want to find the find weights,
bi , that when multiplied by x1 and x2 maximize the covariances with y. That is, we want to
solve the two simultaneous equations
� �
vx1 b1 + cx1x2 b2 = cx1y
.
cx1x2 b1 + vx2 b2 = cx2y

or, in the standardized case, find the βi :


� �
β1 + rx1x2 β2 = rx1y
. (5.2)
rx1x2 β1 + β2 = rx2y

We can directly solve these two equations by adding and subtracting terms to the two
such that we end up with a solution to the first in terms of β1 and to the second in terms of
β2 :
� �
β1 = rx1y − rx1x2 β2
(5.3)
β2 = rx2y − rx1x2 β1
Substituting the second row of (5.3) into the first row, and vice versa we find
� �
β1 = rx1y − rx1x2 (rx2y − rx1x2 β1 )
β2 = rx2y − rx1x2 (rx1y − rx1x2 β2 )

Collecting terms and rearranging :


5.2 Multiple regression 129
� 2 β =r

β1 − rx1x2 1 x1y − rx1x2 rx2y
2 β =r
β2 − rx1x2 2 x2y − rx1x2 rx1y

leads to � �
2 )
β1 = (rx1y − rx1x2 rx2y )/(1 − rx1x2
2 ) (5.4)
β2 = (rx2y − rx1x2 rx1y )/(1 − rx1x2
Alternatively, these two equations (5.2) may be represented as the product of a vector of
unknowns (the β s) and a matrix of coefficients of the predictors (the rxi s) and a matrix of
coefficients for the criterion (rxi y):
� �
r r
(β1 β2 ) x1x1 x1x2 = (rx1y rx2x2 ) (5.5)
rx1x2 rx2x2

� �
rx1x1 rx1x2
If we let β = (β1 β2 ), R = and rxy = (rx1y rx2x2 ) then equation 5.5 becomes
rx1x2 rx2x2

β R = rxy (5.6)

and we can solve Equation 5.6 for β by multiplying both sides by the inverse of R.

β = β RR−1 = rxy R−1 (5.7)

Similarly, if cxy represents the covariances of the xi with y, then the b weights may be found
by
b = cxy S−1
and thus, the predicted scores are

ŷ = β X = rxy R−1 X. (5.8)

The βi are the direct effects of the xi on y. The total effects of xi on y are the correlations,
the indirect effects reflect the product of the correlations between the predictor variables and
the direct effects of each predictor variable.
Estimation of the b or β vectors, with many diagnostic statistics of the quality of the
regression, may be found using the lm function. When using categorical predictors, the linear
model is also known as analysis of variance which may be done using the anova and aov func-
tions. When the outcome variables are dichotomous, logistic regression using the generalized
linear model function glm and a binomial error function. A complete discussion of the power
of the generalized linear model is beyond any introductory text, and the interested reader
is referred to e.g., Cohen et al. (2003); Dalgaard (2002); Fox (2008); Judd and McClelland
(1989); Venables and Ripley (2002).
Diagnostic tests of the regressions, including plots of the residuals versus estimated values,
tests of the normality of the residuals, identification of highly weighted subjects are available
as part of the graphics associated with the lm function.
130 5 Multiple correlation and multiple regression

5.2.1 Direct and indirect effects, suppression and other surprises

If the predictor set xi , x j are uncorrelated, then each separate variable makes a unique con-
tribution to the dependent variable, y, and R2 ,the amount of variance accounted for in y, is
the sum of the individual r2 . In that case, even though each predictor accounted for only
10% of the variance of y, with just 10 predictors, there would be no unexplained variance.
Unfortunately, most predictors are correlated, and the β s found in 5.5 or 5.7 are less than
the original correlations and since

R2 = ∑ βi rxi y = β � rxy

the R2 will not increase as much as it would if the predictors were less or not correlated.
An interesting case that occurs infrequently, but is important to consider, is the case of
suppression. A suppressor may not correlate with the criterion variable, but, because it does
correlate with the other predictor variables, removes variance from those other predictor vari-
ables (Nickerson, 2008; Paulhus et al., 2004). This has the effect of reducing the denominator
in equation 5.5 and thus increasing the betai for the other variables. Consider the case of
two predictors of stock broker success: self reported need for achievement and self reported
anxiety (Table 5.1). Although Need Achievement has a modest correlation with success, and
Anxiety has none at all, adding Anxiety into the multiple regression increases the R2 from
.09 to .12. An explanation for this particular effect might be that people wo want to be stock
brokers are more likely to say that they have high Need Achievement. Some of this vari-
ance is probably legitimate, but some might be due to a tendency to fake positive aspects.
Low anxious scores could reflect a tendency to fake positive by denying negative aspects.
But those who are willing to report being anxious probably are anxious, and are telling the
truth. Thus, adding anxiety into the regression removes some misrepresentation from the
Need Achievement scores, and increases the multiple R1

5.2.2 Interactions and product terms: the need to center the data

In psychometric applications, the main use of regression is in predicting a single criterion


variable in terms of the linear sums of a predictor set. Sometimes, however, a more appropriate
model is to consider that some of the variables have multiplicative effects (i.e., interact) such
the effect of x on y depends upon a third variable z. This can be examined by using the
product terms of x and z. But to do so and to avoid problems of interpretation, it is first
necessary to zero center the predictors so that the product terms are not correlated with the
additive terms. The default values of the scale function will center as well as standardize
the scores. To just center a variable, x, use scale(x,scale=FALSE). This will preserve the
units of x. scale returns a matrix but the lm function requires a data.frame as input. Thus,
it is necessary to convert the output of scale back into a data.frame.
A detailed discussion of how to analyze and then plot data showing interactions between
experimental variables and subject variables (e.g., manipulated positive affect and extraver-
sion) or interactions of subject variables with each other (e.g., neuroticism and extraversion)

1 Atlhough the correlation values are enhanced to show the effect, this particular example was observed
in a high stakes employment testing situation.
5.2 Multiple regression 131

Table 5.1 An example of suppression is found when predicting stockbroker success from self report
measures of need for achievement and anxiety. By having a suppressor variable, anxiety, the multiple
R goes from .3 to .35.

> stock
> mat.regress(stock,c(1,2),3)

Nach Anxiety Success


achievement 1.0 -0.5 0.3
Anxiety -0.5 1.0 0.0
Success 0.3 0.0 1.0

$beta
Nach Anxiety
0.4 0.2
$R
Success
0.35
$R2
Success
0.12

Independent Predictors Correlated Predictors


rx1y rx1y

X1 X1

ß 1 ß 1
x y x y

rx1x2
Y Y

ßx2y ßx2y

X2 X2
rx2y rx2y

Suppression Missing variable


rx1y rx1y

X1 X1

ß 1
x y ß
zx1

rx1x2 ßx2y Y z ß Y
zy

ß
zx2

X2 X2
rx2y

Fig. 5.1 There are least four basic regression cases: The independent predictor where the βi are the
same as the correlations; the normal, correlated predictor case, where the βi are found as in 5.7; the
case of suppression, where although a variable does not correlate with the criterion, because it does
correlate with a predictor, it will have useful βi weight; and the case where the model is misspecified
and in fact a missing variable accounts for the correlations.
132 5 Multiple correlation and multiple regression

is beyond the scope of this text and is considered in great detail by Aiken and West (1991)
and Cohen et al. (2003), and in less detail in an online appendix to a chapter on exper-
imental approaches to personality Revelle (2007), http://personality-project.org/r/
simulating-personality.html. In that appendix, simulated data are created to show ad-
ditive and interactive effects. An example analysis examines the effect of Extraversion and
a movie induced mood on positive affect. The regression is done using the lm function on
the centered data (Table 5.2). The graphic display shows two regression lines, one for the
simulated “positive mood induction”, the other for a neutral induction.

Table 5.2 Linear model analysis of simulated data showing an interaction between the personality
dimension of extraversion and a movie based mood induction. Adapted from Revelle (2007).

> # a great deal of code to simulate the data


> mod1 <- lm(PosAffect ~ extraversion*reward,data = centered.affect.data) #look for interactions
> print(summary(mod1,digits=2)

Call:
lm(formula = PosAffect ~ extraversion * reward, data = centered.affect.data)

Residuals:
Min 1Q Median 3Q Max
-2.062 -0.464 0.083 0.445 2.044

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.8401 0.0957 -8.8 6e-14 ***
extraversion -0.0053 0.0935 -0.1 0.95
reward1 1.6894 0.1354 12.5 <2e-16 ***
extraversion:reward1 0.2529 0.1271 2.0 0.05 *
---
Signif. codes: 0 ^O***~
O 0.001 ^O**~
O 0.01 ^ O*~
O 0.05 ^
O.~
O 0.1 ^
O ~
O 1

Residual standard error: 0.68 on 96 degrees of freedom


Multiple R-squared: 0.63, Adjusted R-squared: 0.62
F-statistic: 54 on 3 and 96 DF, p-value: <2e-16

5.2.3 Confidence intervals of the regression and regression weights

The multiple correlation finds weights to best fit the particular sample. Unfortunately, it is
biased estimate of the population values. Consequently, the value of R2 is likely to shrink
when applied to another sample. Standard estimates for the amount of shrinkage consider
the size of the sample as well as the number of variables in the model. For N subjects and k
predictors, estimated R2 , R̃2 , is
N −1
R̃2 = 1 − (1 − R2 ) .
N −k−1
5.2 Multiple regression 133

Simulated interaction

Positive

2
1
Positive Affect

0
-1
-2

Neutral

1 2 3 4 5

Extraversion

Fig. 5.2 The (simulated) effect of extraversion and movie induced mood on positive affect. Adapted
from Revelle (2007). Detailed code for plotting interaction graphs is available in the appendix.

The confidence interval of R2 is, of course, a function of the variance of R2 which is (taken
from Cohen et al. (2003) and Olkin and Finn (1995))

4R2 (1 − R2 )(N − k− = 1)2


SER22 = .
(N 2 − 1)(N + 3)

Because multiple R is partitioning the observed variance into modeled and residual vari-
ance, testing the hypothesis that the multiple R is zero may be done by analysis of variance
and leads to an F ratio with k and (N-k-1) degrees of freedom:

R2 (n − k − 1)
F= .
(1 − R2 )k

The standard errors of the beta weights is



1 − R2
SEβi =
(N − k − 1)(1 − R2i )

where R2i is the multiple correlation of xi on all the other x j variables. (This term, the squared
multiple correlation, is used in estimating communalities in factor analysis, see 6.2.1. It may
be found by the smc function.).
134 5 Multiple correlation and multiple regression

5.2.4 Multiple regression from the covariance/correlation matrix

Using the raw data allows for error diagnostics and for the inclusion of interaction terms. But
since Equation 5.7 is expressed in terms of the correlation matrix, the regression weights can
be found from the correlation matrix. This is particularly useful if one does not have access
to the raw data (e.g., when reanalyzing a published study), or if the correlation matrix is
synthetically constructed. The function mat.regress allows one to extract subsets of vari-
ables (predictors and criteria) from a matrix of correlations and find the multiple correlations
and beta weights of the x set predicting each member of the y set.

5.2.5 The robust beauty of linear models

Although the β weights 5.7 are the optimal weights, it has been known since Wilks (1938) that
differences from optimal do not change the result very much. This has come to be called “the
robust beauty of linear models” Dawes and Corrigan (1974); Dawes (1979) and follows the
principal of “it don’t make no nevermind” Wainer (1976). That is, for standardized variables
predicting a criterion with .25 < β < .75, setting all betai = .5 will reduce the accuracy of
prediction by no more than 1/96th. Thus the advice to standardize and add. (Clearly this
advice does not work for strong negative correlations, but in that case standardize and
subtract. In the general case weights of -1, 0, or 1 are the robust alternative.)
A graphic demonstration of how a very small reduction in the R2 value can lead to an
infinite set of “fungible weights” that are all equally good in predicting the criterion is the
paper by Waller (2008) with associated R code. This paper reiterates the skepticism that
one should have for the interpretability of any particular pattern of β weights. A much fuller
treatment of the problem of interpreting differences in beta weights is found in the recent
chapter by Azen and Budescu (2009).

5.3 Partial and semi-partial correlation

Given three or more variables, an interesting question to ask is what is the relationship
between xi and y when the effect of x j has been removed? In an experiment it is possible to
answer this by forcing xi and x j to be independent by design. Then it is possible to decompose
the variance of y in terms of effects of xi and x j and possibly their interaction. However, in
the correlational case, it is likely that xi and x j are correlated. A solution is to consider linear
regression to predict xi and y from x j and to correlate the residuals. That is, we know from
linear regression that it is possible to predict xi and y from x j . Then the correlation of the
residuals xi. = xi − x̂i and y. j = y − ŷ j is a measure of the strength of the relationship between
xi and y when the effect of x j has been removed. This is known as the partial correlation, for
it has partialed out the effects on both the xi and y of the other variables.
In the process of finding the appropriate weights in the multiple regression, the effect of
each variable xi on the criterion y was found with the effect of the other x j ( j �= i) variables
removed. This was done explicitly in Equation 5.4 and implicitly in 5.7. The numerator
in 5.4 is a covariance with the effect of the second variable removed and the denominator
5.3 Partial and semi-partial correlation 135

is a variance with the second variable removed. Just as in simple regression where β is a
covariance divided by a variance and a correlation is a covariance divided by the square root
of the product of two variances, so is the case in multiple correlation where the βi is a partial
covariance divided by a partial variance and and a partial correlation is a partial covariance
divided by the square root of the product of two partial variances. The partial correlation
between xi and y with the effect of x j removed is
rxi y − rxi x j rx j y
r(xi .x j )(y.x j ) = � (5.9)
(1 − rx2i x j )(1 − ryx
2 )
j

Compare this to 5.4 which is the formula for the β weight.


Given a data matrix, X and a matrix of covariates, Z, with correlations Rxz with X, and
correlations Rz with each other, the residuals, X* will be

X∗ = X − Rxz R−1
z Z

To find the matrix of partial correlations, R* where the effect of a number of the Z variables
been removed, just express equation 5.9 in matrix form. First find the residual covariances,
C* and then divide these by the square roots of the residual variances (the diagonal elements
of C*).
C∗ = (R − Rxz R−1
z )
� −1 � −1
R∗ = ( diag(C∗ ) C∗ diag(C∗ ) (5.10)
Consider the correlation matrix of five variables seen in Table 5.3. The partial correlations of
the first three with the effect of the last two removed is found using the partial.r function.

Table 5.3 Using partial.r to find a matrix of partial correlations


> R.mat

V1 V2 V3 V4 V5
V1 1.00 0.56 0.48 0.40 0.32
V2 0.56 1.00 0.42 0.35 0.28
V3 0.48 0.42 1.00 0.30 0.24
V4 0.40 0.35 0.30 1.00 0.20
V5 0.32 0.28 0.24 0.20 1.00

> partial.r(R.mat,c(1:3),c(4:5)) #specify the matrix for input, and the columns for the X and Z variables

V1 V2 V3
V1 1.00 0.46 0.38
V2 0.46 1.00 0.32
V3 0.38 0.32 1.00

The semi-partial correlation, also known as the part-correlation is the correlation between
xi and y removing the effect of the other x j from the predictor, xi , but not from the criterion,
y. It is just
rxi y − rxi x j rx j y
r(xi .x j )(y) = � (5.11)
(1 − rx2i x j )
136 5 Multiple correlation and multiple regression

express in matrix
form

5.3.1 Alternative interpretations of the partial correlation

Partial correlations are used when arguing that the effect of xi on y either does or does remain
when other variables, x j are statistically “controlled”. That is, in Table 5.3, the correlation
between V1 and V2 is very high, even when the effects of V4 and V5 are removed. But this
interpretation requires that each variable is measured without error. An alternative model
that corrects for error of measurement (unreliability) would show that when the error free
parts of V4 and V5 are used as covariates, the partial correlation between V1 and V2 becomes
0.. This issue will be discussed in much more detail when considering models of reliability as
well as factor analysis and structural equation modeling.

5.4 Alternative regression techniques

That the linear model can be used with categorical predictors has already been discussed.
Generalizations of the linear model to outcomes than are not normally distributed fall under
the class of the generalized linear model and can found using the glm function. One of the
most common extensions is to the case of dichotomous outcomes (pass or fail, survive or die)
which may be predicted using logistic regression. Another generalization is to non-normally
distributed count data or rate data where either Poisson regression or negative binomial
regression are used. These models are solved by iterative maximum likelihood procedures
rather than ordinary least squares as used in the linear model.
The need for these generalizations is that the normal theory of the linear model is inap-
propriate for such dependent variables. (e.g., what is the meaning of a predicted probability
higher than 1 or less than 0?) The various generalizations of the linear model transform the
dependent variable in some way so as to make linear changes in the predictors lead to linear
changes in the transformed dependent variable. For more complete discussions of when to
apply the linear model versus generalizations of these models, consult Cohen et al. (2003) or
Gardner et al. (1995).

5.4.1 Logistic regression

Consider, for example, the case of a binary outcome variable. Because the observed values
can only be 0 or 1, it is necessary to predict the probability of the score rather than the
score itself. But even so, probabilities are bounded (0,1) so regression estimates less than
0 or greater than 1 are meaningless. A solution is to analyze not the data themselves, but
rather a monotonic transformation of the probabilities, the logistic function:
1
p(Y |X) = .
1 + e−(β0 +β x)
5.4 Alternative regression techniques 137

Using deviation scores, if the likelihood, p(y), of observing some binary outcome, y, is a
continuous function of a predictor set, X, where each column of X, xi , is related to the
outcome probability with a logistic function where β0 is the predicted intercept and βi is the
effect of xi
1
p(y|x1 . . . xi . . . xn ) =
1 + e−(β0 +β1 x1 +...βi xi +...βn xn )
and therefore, the likelihood of not observing y, p(ỹ), given the same predictor set is

1 e−(β0 +β1 x1 +...βi xi +...βn xn )


p(ỹ|x1 . . . xi . . . xn ) = 1 − =
1 + e−(β0 +β1 x1 +...βi xi +...βn xn ) 1 + e−(β0 +β1 x1 +...βi xi +...βn xn )
then the odds ratio of observing y to not observing y is

p(y|x1 . . . xi . . . xn ) 1
= −(β +β x +...β x +...β x ) = e(β0 +β1 x1 +...βi xi +...βn xn ) .
p(ỹ|x1 . . . xi . . . xn ) e 0 1 1 i i n n

Thus, the logarithm of the odds ratio (the log odds) is a linear function of the xi :

ln(odds) = β0 + β1 x1 + . . . βi xi + . . . βn xn = β0 + β X (5.12)

Consider the probability of being a college graduate given the predictors of age and several
measures of ability. The data set sat.act has a measure of education (0 = not yet finished
high school, ..., 5 have a graduate degree). Converting this to a dichotomous score (education
>3) to identify those who have finished college or not, and then predicting this variable by a
logistic regression using the glm function shows that age is positively related to the probability
of being a college graduate (not an overly surprising result) as is a higher ACT (American
College Testing program) score. The results are expressed as changes in the logarithm of
the odds for unit changes in the predictors. Expressing these as odds ratios may be done by
taking the anti-log (i.e., the exponential) of the parameters. The confidence intervals of the
parameters or of the Odds Ratios may be found by using the confinit function (Table 5.4).

5.4.2 Poisson regression, quasi-Poisson regression, and


negative-binomial regression

If the underlying process is thought to be binary with a low probability of one of the two
alternatives (e.g., scoring a goal in a football tournament, speaking versus not speaking in
a classroom, becoming sick or not, missing school for a day, dying from being kicked by a
horse, a flying bomb hit in a particular area, a phone trunk line being in use, etc.) sampled
over a number of trials and the measure is the discrete counts (e.g., 0, 1, ... n= number of
responses) of the less likely alternative, one appropriate distributional model is the Poisson.
The Poisson is the limiting case of a binomial over N trials with probability p for small p.
For a random variable, Y, the probability that it takes on a particular value, y, is

e−λ λ y
p(Y = y) =
y!

where both the expectation (mean) and variance of Y are


138 5 Multiple correlation and multiple regression

Table 5.4 An example of logistic regression using the glm function. The resulting coefficients are the
parameters of the logistic model expressed in the logarithm of the odds. They may be converted to
odds ratios by taking the exponential of the parameters. The same may be done with the confidence
intervals of the parameters and of the odds ratios.

> data(sat.act)
> college <- (sat.act$education > 3) +0 #convert to a binary variable
> College <- data.frame(college,sat.act)
> logistic.model <- glm(college~age+ACT,family=binomial,data=College)
> summary(logistic.model)

Call:
glm(formula = college ~ age + ACT, family = binomial, data = College)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.8501 -0.6105 -0.4584 0.5568 1.7715
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.78855 0.79969 -9.739 <2e-16 ***
age 0.23234 0.01912 12.149 <2e-16 ***
ACT 0.05590 0.02197 2.544 0.0109 *
---
Signif. codes: 0 ^ O***~
O 0.001 ^
O**~
O 0.01 ^O*~
O 0.05 ^
O.~
O 0.1 ^
O ~
O 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 941.40 on 699 degrees of freedom
Residual deviance: 615.32 on 697 degrees of freedom
AIC: 621.32
Number of Fisher Scoring iterations: 5

> round(exp(coef(logistic.model)),2)
> round(exp(confint(logistic.model)),digits=3)

(Intercept) age ACT


0.00 1.26 1.06
2.5 % 97.5 %
(Intercept) 0.000 0.002
age 1.217 1.312
ACT 1.014 1.105

E(Y ) = var(Y ) = λ (5.13)

and y factorial is
y! = y ∗ (y − 1) ∗ (y − 2)... ∗ 2 ∗ 1.
The sum of independent Poisson variables is itself distributed as a Poisson variable, so it is
possible to aggregate data across an independent grouping variable.
Poisson regression models the mean for Y by modeling λ as an exponential function of
the predictor set (xi )
E(Y ) = λ = eα+β1 x1 +···+β p x p
and the log of the mean will thus be a linear function of the predictors.
Several example data sets are available in R to demonstrate the advantages of Poisson
regression over simple linear regression. epil in MASS reports the number of epileptic seizures
5.4 Alternative regression techniques 139

before and after administration of an anti-seizure medication or a placebo as a function of


age and other covariates, quine (also in MASS reports the rate of absenteeism in a small
town in Australia as a function of culture, age, sex, and learning ability.
In the Poisson model, the mean has the same value as the variance (Equation 5.13).
However, overdispersed data have larger variances than expected from the Poisson model.
Examples of such data include the number of violent episodes of psychatric patients (Gardner
et al., 1995) or the number of seals on a beach pull out (Ver Hoef and Boveng, 2007). Such data
should be modeled using the negative binomial or an overdispered Poisson model Gardner
et al. (1995). The over-dispersed Poisson model adds an additional parameter, γ, to the
Poisson variance model:
var(Y ) = γλ .
These generalizations of the linear model make use of the glm function with a link function
of the appropriate error family for the particular model. Thus, logistic regression uses the
binomial family (Table 5.4) and poisson regression uses a logarthmic family (Table 5.5).
Negative binomial regression may be done using the glm.nb function from the MASS package.

5.4.3 Using multiple regression for circular data

Some variables show a cyclical pattern over periods of hours, days, months or years. In
psychology perhaps the best example is the diurnal rhythm of mood and energy. Energetic
arousal is typically low in the early morning, rises to a peak sometime between 12:00 and
16:00, and then decreases during the evening. Such rhythms can be described in terms of
their period and their phase. If the period of a rhythm is about 24 hours, it is said to be
circadian. The acrophase of a variable is that time of day when the variable reaches its
maximum. A great deal of research has shown that people differ in the time of day at which
they achieve their acrophase for variables ranging from cognitive performance (Revelle, 1993;
Revelle et al., 1980) to positive affect (Rafaeli et al., 2007; Thayer et al., 1988) (however, for
some measures, such as body temperature, the minimum is more precise measure of phase
than is the maximum (Baehr et al., 2000)). If we know the acrophase, we can use circular
statistics to find the mean and correlation of these variables with other circular variables
(3.4.1). The acrophase itself can be estimated using linear regression, not of the raw data
predicted by time of day, but rather by multiple regression using the sine and cosine of time
of day (expressed in radians).
Consider four different emotion variables, Energetic Arousal , Positive Affect, Tense
Arousal and Negative Affect. Assume that all four of these variable show a diurnal rhythm,
but differ in their phase (Figure 5.3). Consider the example data set created in Table 5.6.
Four curves are created (top panel of Figure 5.3) with different phases, but then have error
added to them (lower panel of Figure 5.3). The cosinor function estimates the phase angle
by fitting each variable with multiple regression where the predictors are cos(time ∗ 2π/24)
and sin(time ∗ 2π/24). The resulting β weights are then transformed into phase angles (in
radians) by
−βsin βcos
φ = tan−1 ( )= 2 .
βcos 2
βcos + βsin
140 5 Multiple correlation and multiple regression

Table 5.5 Using the general linear model glm to do Poisson regression for the effect of an anti-seizure
drug on epilepsy attacks. The data are from the epil data set in MASS. Compare this analysis with
a simple linear model or with a linear model of the log transformed data. Note that the effect of the
drug in the linear model is not statistically different from zero, but is in the Poisson regression.

> data(epil)
> summary(glm(y~trt+base,data=epil,family=poisson))

Call:
glm(formula = y ~ trt + base, family = poisson, data = epil)

Deviance Residuals:
Min 1Q Median 3Q Max
-4.6157 -1.5080 -0.4681 0.4374 12.4054

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.278079 0.040709 31.396 < 2e-16 ***
trtprogabide -0.223093 0.046309 -4.817 1.45e-06 ***
base 0.021754 0.000482 45.130 < 2e-16 ***
---
Signif. codes: 0 ^O***~O 0.001 ^
O**~
O 0.01 ^
O*~
O 0.05 ^
O.~
O 0.1 ^
O ~
O 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 2517.83 on 235 degrees of freedom


Residual deviance: 987.27 on 233 degrees of freedom
AIC: 1759.2

Number of Fisher Scoring iterations: 5

> summary(lm(y~trt+base,data=epil))

lm(formula = y ~ trt + base, data = epil)

Residuals:
Min 1Q Median 3Q Max
-19.40019 -3.29228 0.02348 2.11521 58.88226

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.27396 0.96814 -2.349 0.0197 *
trtprogabide -0.91233 1.04514 -0.873 0.3836
base 0.35258 0.01958 18.003 <2e-16 ***
---
Signif. codes: 0 ^O***~
O 0.001 ^
O**~
O 0.01 ^
O*~
O 0.05 ^ O.~
O 0.1 ^
O ~
O 1

Residual standard error: 8.017 on 233 degrees of freedom


Multiple R-squared: 0.582, Adjusted R-squared: 0.5784
F-statistic: 162.2 on 2 and 233 DF, p-value: < 2.2e-16
5.4 Alternative regression techniques 141

φ ∗24
This result may be transformed back to hours by phase = 2π . Other packages that use
circular statistics are the circular and CircStats packages.

Table 5.6 Many emotional variables show a diurnal rhythm. Here four variables are simulated in their
pure form, and then contaminated by noise. Phase is estimated by the cosinor function.

> set.seed(42)
> nt = 4
> time <- seq(1:24)
> pure <- matrix(time,24,nt)
> pure <- cos((pure + col(pure)*nt)*pi/12)
> diurnal <- data.frame(time,pure)
> noisy <- pure + rnorm(24*nt)/2
> circadian <- data.frame(time,noisy)
> colnames(circadian) <- colnames(diurnal) <- c("Time", "NegA","TA","PA","EA")
> p <- cosinor(diurnal)
> n <- cosinor(circadian)
> round(data.frame(phase=p[,1],estimate=n[,1],fit=n[,2]),2)

> matplot(pure,type="l",xlab = "Time of Day",ylab="Intensity",


main="Hypothetical emotional curves",xaxp=c(0,24,8))
> matplot(noisy,type="l",xlab = "Time of Day",ylab="Arousal",
main="Noisy arousal curves",xaxp=c(0,24,8))

phase estimate fit


NegA 20 20.59 0.61
TA 16 16.38 0.76
PA 12 12.38 0.81
EA 8 8.26 0.84

5.4.4 Robust regression using M estimators

Robust techniques estimate relationships trying to correct for unusual data (outliers). A
number of packages include functions that apply robust techniques to estimate correlations,
covariances, and linear regressions. The MASS package, robust, robustbase all include ro-
bust estimation procedures. Consider the stackloss data set in the MASS package. A
pairs.panels plot of the data suggests that three cases are extreme outliers. The robust
linear regression function rlm shows a somewhat different pattern of estimates than does
ordinary regression.
An interesting demonstration of the power of the human eye to estimate relationships was
presented by Wainer and Thissen (1979) who show that visual displays are an important
part of the data analytic enterprise. Students shown figures representing various pure cases
of correlation were able to estimate the underlying correlation of contaminated data better
than many of the more classic robust estimates. This is an important message: Look at your
data!. Do not be misled by simple (or even complex) summary statistics. The power of the
eye to detect outliers, non-linearity, and just general errors can not be underestimated.
142 5 Multiple correlation and multiple regression

Hypothetical arousal curves

0.0 0.5 1.0


Arousal

-1.0

3 6 9 12 15 18 21 24

Time of Day

Noisy arousal curves


1
Arousal

0
-1
-2

3 6 9 12 15 18 21 24

Time of Day

Fig. 5.3 Some psychological variables have a diurnal rhythm. The phase of the rhythm may be
estimated using the cosinor function using multiple regression of the sine and cosine of the time of
day. The top panel shows four diurnal rhythms with acrophases of 8, 12, 16, and 20. The lower panel
plots the same data, but with random noise added to the signal. The corresponding phases estimated
using cosinor are 8.3, 12.4, 16.4 and 20.6.
Chapter 6
Constructs, Components, and Factor models

Parsimony of description has been a goal of science since at least the famous dictum commonly
attributed to William of Ockham to not multiply entities beyond necessity1 . The goal for
parsimony is seen in psychometrics as an attempt either to describe (components) or to
explain (factors) the relationships between many observed variables in terms of a more limited
set of components or latent factors.
The typical data matrix represents multiple items or scales usually thought to reflect fewer
underlying constructs2 . At the most simple, a set of items can be be thought of representing
random samples from one underlying domain or perhaps a small set of domains. The ques-
tion for the psychometrician is how many domains are represented and how well does each
item represent the domains. Solutions to this problem are examples of factor analysis (FA),
principal components analysis (PCA), and cluster analysis (CA). All of these procedures aim
to reduce the complexity of the observed data. In the case of FA, the goal is to identify fewer
underlying constructs to explain the observed data. In the case of PCA, the goal can be mere
data reduction, but the interpretation of components is frequently done in terms similar to
those used when describing the latent variables estimated by FA. Cluster analytic techniques,
although usually used to partition the subject space rather than the variable space, can also
be used to group variables to reduce the complexity of the data by forming fewer and more
homogeneous sets of tests or items.
At the data level the data reduction problem may be solved as a Singular Value Decompo-
sition of the original matrix, although the more typical solution is to find either the principal
components or factors of the covariance or correlation matrices. Given the pattern of regres-
sion weights from the variables to the components or from the factors to the variables, it
is then possible to find (for components) individual component or cluster scores or estimate
(for factors) factor scores.
Consider the matrix X of n deviation scores for N subjects, where each element, xi j ,
represents the responses of the ith individual to the jth item or test. For simplicity, let the xi j
scores in each column be deviations from the mean for that column (i.e., they are column
centered, perhaps by using scale). Let the number of variables be n. Then the covariance
matrix, Cov, is

1 Although probably neither original with Ockham nor directly stated by him (Thorburn, 1918),
Ockham’s razor remains a fundamental principal of science.
2 Cattell (1978) as well as MacCallum et al. (2007) argue that the data are the result of many more

factors than observed variables, but are willing to estimate the major underlying factors.

145
146 6 Constructs, Components, and Factor models

Cov = N −1 XX�
and the standard deviations are �
sd = diag(Cov).
1
Let the matrix Isd be a diagonal matrix with elements = sdi , then the correlation matrix R
is
R = Isd CovIsd .
The problem is how to approximate the matrix, R of rank n, with a matrix of lower rank?
The solution to this problem may be seen if we think about how to create a model matrix
to approximate R.
Consider the correlation matrix R formed as the matrix product of a vector f (Table 6.1)3 :
By observation, except for the diagonal, R seems to be a multiplication table with the first

Table 6.1 Creating a correlation matrix from a factor model. In this case, the factor model is a single
vector f and the correlation matrix is created as the product of ff� with the additional constraint that
the diagonal of the resulting matrix is set to 1.

> f <- seq(.9,.4,-.1) #the model


> f
[1] 0.9 0.8 0.7 0.6 0.5 0.4

> R <- f %*% t(f) #create the correlation matrix


> diag(R) <- 1
> rownames(R) <- colnames(R) <- paste("V",seq(1:6),sep="")
> R

V1 V2 V3 V4 V5 V6
V1 1.00 0.72 0.63 0.54 0.45 0.36
V2 0.72 1.00 0.56 0.48 0.40 0.32
V3 0.63 0.56 1.00 0.42 0.35 0.28
V4 0.54 0.48 0.42 1.00 0.30 0.24
V5 0.45 0.40 0.35 0.30 1.00 0.20
V6 0.36 0.32 0.28 0.24 0.20 1.00

column representing the .9s, the second column the .8s, etc. Is it possible to represent this
matrix in a more parsimonious way? That is, can we determine the vector f that generated
the correlation matrix. (This is sometimes seen as the problem of unscrambling eggs.) There
are two broad answers to this question. The first is a model that approximates the correlation
matrix in terms of the product of components where each component is a weighted linear sum
of the variables, the second model is also an approximation of the correlation matrix by the
product of two factors, but the factors in this are seen as causes rather than as consequences
of the variables.
That is
R ≈ CC� (6.1)

3 Although the following discussion will be done in terms of correlation matrices, goodness of fit tests
are more typically done on covariance matrices. It is somewhat simpler to do the discussion in terms
of correlations.
6.1 Principal Components: an observed variable model 147

where, if n is the number of variables in R, then the ith component, Ci , is a linear sum of the
variables:
n
Ci = ∑ wi j x j . (6.2)
j=1

The factor model appears to be very similar, but with the addition of a diagonal matrix
of uniqueness (U2 ), and
R ≈ FF� + U2 (6.3)
with the variables described as weighted linear sums of the (unknown) factors:
n
xi ≈ ∑ wi j Fj . (6.4)
j=1

The weights in Equation 6.4 although similar, are not the same as the those in Equation 6.2.
One way to think of the difference between the two models is in terms of path diagrams. For
the component model, the component is the linear sum of the known variables. This is shown
by the arrows from the variables to the component in the left hand panel of Figure 6.1. The
factor model, on the other hand, represents each variable as the linear sum of two parts, the
common factor and a unique factor for each variable. These effects are shown as arrows from
the factor to the variables and curved arrows from the variables to themselves. (The notation
represents observed variables with square boxes, unobservable or latent variables with circles
or ellipses. The coefficients are the weights to be found.) To solve for the components is a
straightforward exercise in linear algebra, to solve the factors is a bit more complicated.

6.1 Principal Components: an observed variable model

Principal components is a method that finds the basis space of a particular correlation or
covariance matrix. That is, it is a way of finding a matrix of orthogonal vectors that can
represent the original matrix. This may be done by finding the eigenvectors and eigenvalues
of the original matrix. The eigenvectors are also known as the characteristic roots of the
matrix, and the eigenvalues are simply scalings of the roots.

6.1.1 Eigenvalues and Eigenvectors

As reviewed in Appendix E given a nxn matrix R, each eigenvector, xi , solves the equation

xi R = λi xi

and the set of n eigenvectors are solutions to the equation

RX = Xλ

where X is a matrix of orthogonal eigenvectors and λ is a diagonal matrix of the the eigen-
values, λi . Then
148 6 Constructs, Components, and Factor models

Component model Factor model

x1 x1

x2 a x2 a
b b
x3 c x3 c
d Y1 d X1
x4 e x4 e
f f
x5 x5
x6 x6

Fig. 6.1 Components are linear sums of variables and do not necessarily say anything about the
correlations between the variables. Factors are latent variables thought to explain the correlations or
covariances between observed variables. The fundamental factor equation expresses the correlations
between variables in terms of the factor loadings. The variance of each variable is made up of two
unobservable parts, that which is common to the variables (the factor) and that which is unique to each
variable. These unique components are shown as curved arrows back to the variable. Observed variables
are represented by boxes, unobserved variables by ellipses. Graph created using structure.graph.

xi R − λi XI = 0 <=> xi (R − λi I) = 0
Finding the eigenvectors and eigenvalues is computationally tedious, but may be done using
the eigen function which uses a QR decomposition of the matrix. That the vectors making
up X are orthogonal means that
XX� = I
and that
R = Xλ X� .
That is, it is possible to recreate the correlation matrix R in terms of an orthogonal set of
vectors (the eigenvectors) scaled by their associated eigenvalues. Both the eigenvectors and
eigenvalues are found by using the eigen function.
6.1 Principal Components: an observed variable model 149

Principal components are simply the eigenvectors scaled by the square root of the eigen-
values: √
C=X λ
and thus R = CC� .

Table 6.2 Eigenvalue decomposition of a matrix produces a vector of eigenvalues and a matrix of
eigenvectors. The product of the eigenvector with its transpose is the identity matrix. Defining a
component as the eigenvector matrix scaled by the squareroot of the eigenvalues leads to recreating
the matrix by the product of the component matrix times its transpose.

> e <- eigen(R) #eigenvalue decomposition


> print(e,digits=2) #show the solution

$values
[1] 3.16 0.82 0.72 0.59 0.44 0.26
$vectors
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.50 0.061 0.092 0.14 0.238 0.816
[2,] -0.47 0.074 0.121 0.21 0.657 -0.533
[3,] -0.43 0.096 0.182 0.53 -0.675 -0.184
[4,] -0.39 0.142 0.414 -0.78 -0.201 -0.104
[5,] -0.34 0.299 -0.860 -0.20 -0.108 -0.067
[6,] -0.28 -0.934 -0.178 -0.10 -0.067 -0.045

round(e$vectors %*% t(e$vectors),2) #show that the eigenvectors are orthogonal

[,1] [,2] [,3] [,4] [,5] [,6]


[1,] 1 0 0 0 0 0
[2,] 0 1 0 0 0 0
[3,] 0 0 1 0 0 0
[4,] 0 0 0 1 0 0
[5,] 0 0 0 0 1 0
[6,] 0 0 0 0 0 1

> C <- e$vectors %*% diag(sqrt(e$values)) #components


> C %*% t(C) #reproduce the correlation matrix

[,1] [,2] [,3] [,4] [,5] [,6]


[1,] 1.00 0.72 0.63 0.54 0.45 0.36
[2,] 0.72 1.00 0.56 0.48 0.40 0.32
[3,] 0.63 0.56 1.00 0.42 0.35 0.28
[4,] 0.54 0.48 0.42 1.00 0.30 0.24
[5,] 0.45 0.40 0.35 0.30 1.00 0.20
[6,] 0.36 0.32 0.28 0.24 0.20 1.00

6.1.2 Principal components

Typically we do not want to exactly reproduce the original n ∗ n correlation matrix, for there
is no gain in parsimony (the rank of the C matrix is the same as the rank of the original R
matrix) but rather want to approximate it with a matrix of lower rank (k < n). This may be
150 6 Constructs, Components, and Factor models

done by using just the first k principal components. This requires selecting and rescaling the
first k components returned by the functions princomp and prcomp (Anderson, 1963). Alter-
natively, the principal function will provide the first k components scaled appropriately.
Consider just the first principal component of the matrix R (Table 6.3). The loadings
matrix shows the correlations of each variable with the component. The uniquenesses, a
concept from factor analysis, reflect the variance not explained for each variable. As is seen
in Table 6.3, just one component does not reproduce the matrix very well, for it overestimates
the correlations and underestimates the elements on the diagonal. The components solution,
in attempting to account for the entire matrix, underestimates the importance of the major
variables, and overestimates the importance of the least important variables. This is due to
the influence of the diagonal elements of the matrix which are also being fitted. This is most
clearly seen by examining the residual matrix of the difference between R and the model of
R which is the product of the first principal component with its transpose. Increasing the
number of components used will provide a progressively better approximation to the original
R matrix, but at a cost of a reduction in parsimony.
If the goal is simple and parsimonious description of a correlation or covariance matrix,
the first k principal components will do a better job than any other k-dimensional solution.

6.2 Exploratory Factor Analysis: a latent variable model

Originally developed by Spearman (1904a) for the case of one common factor, and then later
generalized by Thurstone (1947) and others to the case of multiple factors, factor analysis
is probably the most frequently used and sometimes the most controversial psychometric
procedure. The factor model (Equation 6.3), although seemingly very similar to the com-
ponents model, is in fact very different. For rather than having components as linear sums
of variables, in the factor model the variables are themselves linear sums of the unknown
factors. That is, while components can be solved for by doing an eigenvalue or singular
value decomposition, factors are estimated as best fitting solutions (Eckart and Young, 1936;
Householder and Young, 1938), normally through iterative methods (Jöreskog, 1978; Lawley
and Maxwell, 1963). Cattell (1965) referred to components analysis as a closed model and
factor analysis as an open model, in that by explaining just the common variance, there was
still more variance to explain.
Why is factor analysis controversial? At the structural level (i.e., the covariance or corre-
lation matrix), there are normally more observed variables than parameters to estimate and
the procedure is merely one of finding the best fitting solution using ordinary least squares,
weighted least squares, or maximum likelihood (Jöreskog, 1978; Lawley and Maxwell, 1963).
But at the data level, as Schonemann and Steiger (1978); Schonemann (1990); Steiger (1990)
have shown, the model is indeterminate, although scores can be estimated. Velicer and Jack-
son (1990b,a) provide an extensive discussion of the differences between the two models and
argue for the utility of PCA. Commentaries defending FA emphasize the benefits of FA for
theory construction and evaluation (Loehlin, 1990; McArdle, 1990; Preacher and MacCallum,
2003).
The factor model partitions the correlation or covariance matrix into that which is pre-
dictable by common factors, FF� , and that which is unique, U2 (the diagonal matrix of
uniquenesses). Within R there are at several main algorithms that are used to estimate the
factor coefficients. maximum likelihood may be done using the factanal function, or the
6.2 Exploratory Factor Analysis: a latent variable model 151

Table 6.3 First principal component of the R matrix. The solution, although capturing the rank order
of how important the variables are, underestimates the highest ones, over estimates the lowest ones,
and fails to account for diagonal of the matrix. This is shown most clearly by considering the matrix of
residuals (R - CC� , the data - the model). Compare these residuals to those found using principal axes
factor analysis as seen in Table 6.4. The output of principal is an object of type psych to allow for
comparisons with factor analysis output from the factor analysis function, fa. The the uniquenesses are
diag(R − CC� ) and the loadings are correlations of the components with the variables. It is important
to note that the recovered loadings do not match the factor loadings that were used to generate the
data (Table 6.1).

> pc1 <- principal(R,1)


> pc1

Uniquenesses:
V1 V2 V3 V4 V5 V6
0.220 0.307 0.408 0.519 0.635 0.748
Loadings:
PC1
V1 0.88
V2 0.83
V3 0.77
V4 0.69
V5 0.60
V6 0.50
PC1
SS loadings 3.142
Proportion Var 0.524

> round(pc1$loadings %*% t(pc1$loadings),2) #show the model

V1 V2 V3 V4 V5 V6
V1 0.77 0.73 0.68 0.61 0.53 0.44
V2 0.73 0.69 0.64 0.57 0.50 0.42
V3 0.68 0.64 0.59 0.53 0.46 0.38
V4 0.61 0.57 0.53 0.48 0.41 0.34
V5 0.53 0.50 0.46 0.41 0.36 0.30
V6 0.44 0.42 0.38 0.34 0.30 0.25

> Rresid <- R - pc1$loadings %*% t(pc1$loadings) #find the residuals


> round(Rresid,2)

V1 V2 V3 V4 V5 V6
V1 0.23 -0.01 -0.05 -0.07 -0.08 -0.08
V2 -0.01 0.31 -0.08 -0.09 -0.10 -0.09
V3 -0.05 -0.08 0.41 -0.11 -0.11 -0.10
V4 -0.07 -0.09 -0.11 0.52 -0.11 -0.10
V5 -0.08 -0.10 -0.11 -0.11 0.64 -0.10
V6 -0.08 -0.09 -0.10 -0.10 -0.10 0.75
152 6 Constructs, Components, and Factor models

fa function in the psych package. fa will also provide a principal axes, minimum residual ,
weighted least squares or generalized least squares solution. Several other algorithms are avail-
able in packages such as FAiR and even more are possible but have not yet been implemented
into R.

6.2.1 Principal Axes Factor Analysis as an eigenvalue


decomposition of a reduced matrix

Principal components represents a n ∗ n matrix in terms of the first k components. It attempts


to reproduce all of the R matrix. Factor analysis on the other hand, attempts to model just
the common part of the matrix, which means all of the off-diagonal elements and the common
part of the diagonal (the communalities). The non-common part, the uniquenesses, are simply
that which is left over. An easy to understand procedure is principal axes factor analysis.
This is similar to principal components, except that it is done with a reduced matrix where
the diagonals are the communalities. The communalities can either be specified a priori,
estimated by such procedures as multiple linear regression, or found by iteratively doing an
eigenvalue decomposition and repeatedly replacing the original 1s on the diagonal with the
the value of 1 - u2 where
U2 = diag(R − FF� ).
That is, starting with the original correlation or covariance matrix, R, find the k largest
principal components, reproduce the matrix using those principal components. Find the
resulting residual matrix, R∗ and uniqueness matrix, U2 by

R∗ = R − FF� (6.5)

U2 = diag(R∗ )
and then, for iteration i, find Ri by replacing the diagonal of the original R matrix with 1 -
diag(U2 ) found on the previous step. Repeat this process until the change from one iteration
to the next is arbitrarily small. This procedure is implemented in the fa and factor.pa
functions in the psych package. (See Table 6.4 for an example of how well a single factor
can reproduce the original correlation matrix. The eggs have been unscrambled.) It is a
useful exercise to run fa using the principal axis factor method (fm= “pa”) and specifying
the number of iterations (e.g., max.iter=2). Then examine the size of the residuals as the
number of iterations increases. When this is done, the solution gets progressively better, in
that the size of the residuals in the off diagonal matrix become progressively smaller.
Rather than starting with initial communality estimates of 1, the process can be started
with other estimates of the communality. A conventional starting point is the lower bound
estimate of the communalities, the squared multiple correlation or SMC (Roff, 1936). The
concept here is that a variable’s communality must be at least as great as the amount of its
variance that can be predicted by all of the other variables. A simple proof that the SMC is
a lower bound for the communality was given by Harris (1978). A somewhat greater lower
bound has been proposed by Yanai and Ichikawa (1990). The squared multiple correlations
of each variable with the remaining variables are the diagonal elements of

I − (diag(R−1 )−1
6.2 Exploratory Factor Analysis: a latent variable model 153

and thus a starting estimate for R0 would be R − (diag(R−1 )−1 .


It is interesting to recognize that the obverse of this relationship holds and the communality
of a variable is the upper bound of the squared multiple correlation with that variable. To
increase the predictability of a variable, x, from a set of other variables, Y, it does not help
to add variables to the Y set that load on factors already measured, for this will not change
the communality. It is better to find variables that correlate with x but measure factors
not already included in Y (Guttman, 1940). Tryon (1957) provides a detailed discussion
of multiple ways to estimate a variable’s communality, with particular reference to domain
sampling models of reliability. For most problems, the initial communality estimate does not
have an effect upon the final solution, but in some cases, adjusting the start values to be 1
rather than the SMC will lead to better solutions when doing principal axes decompositions.
At least three indices of goodness of fit of the principal factors model can be considered:
One compares the sum of squared residuals to the sum of the squares of the original values:

1R∗2 1�
GFtotal = 1 −
1R2 1�
The second does the same, but does not consider the diagonal of R

∑i�= j ri∗2j 1R∗2 1� − tr(1R∗2 1�


GFo f f diagonal = 1 − = 1 −
∑i�= j ri∗2j 1R2 1� − tr(1R2 1� )

That is, if the sum is taken over all i �= j, then this evaluates how well the factor model fits
the off diagonals, if the sum includes the diagonal elements as well, then the GF statistic is
evaluating the overall fit of the model to the data. Finally, a χ 2 test of the size of the residuals
simply sums all the squared residuals and multiplies by the number of observations:

χ 2 = ∑ r ∗2i j (N − 1)
i< j

with p * (p-1)/2 degrees of freedom.

6.2.2 Maximum Likelihood Factor Analysis and its alternatives

The fundamental factor equation (Equation 6.3) may be viewed as set of simultaneous equa-
tions which may be solved several different ways: ordinary least squares, generalized least
squares, and maximum likelihood . Ordinary least squares (OLS ) or unweighted least squares
(ULS ) minimizes the sum of the squared residuals when modeling the sample correlation or
covariance matrix, S, with Σ = FF� + U2
1
E = tr(S − Σ )2 (6.6)
2
where the trace, tr, of a matrix is the sum of the diagonal elements and the division by two
reflects the symmetry of the S matrix. As discussed by Jöreskog (1978) and Loehlin (2004),
Equation 6.6 can be generalized to weight the residuals (S − Σ ) by the inverse of the sample
matrix, S, and thus to minimize
154 6 Constructs, Components, and Factor models

Table 6.4 Principal Axis Factor Analysis with one factor. The solution was iterated until the com-
munality estimates (1- the uniquenesses) failed to change from one iteration to the next. Notice how
the off diagonal of the residual matrix is effectively zero. Compare this with the solution of principal
components seen in Table 6.3.

> f1 <- factor.pa(R)


> f1

Factor Analysis using method = pa


Call: factor.pa(r = R)
V PA1 h2 u2
1 1 0.9 0.81 0.19
2 2 0.8 0.64 0.36
3 3 0.7 0.49 0.51
4 4 0.6 0.36 0.64
5 5 0.5 0.25 0.75
6 6 0.4 0.16 0.84

PA1
SS loadings 2.71
Proportion Var 0.45

Test of the hypothesis that 1 factor is sufficient.

The degrees of freedom for the model is 9 and the fit was 0

Measures of factor score adequacy PA1


Correlation of scores with factors 0.94
Multiple R square of scores with factors 0.89
Minimum correlation of factor score estimates 0.78
Validity of unit weighted factor scores 0.91

> f1$loadings %*% t(f1$loadings) #show the model

V1 V2 V3 V4 V5 V6
V1 0.81 0.72 0.63 0.54 0.45 0.36
V2 0.72 0.64 0.56 0.48 0.40 0.32
V3 0.63 0.56 0.49 0.42 0.35 0.28
V4 0.54 0.48 0.42 0.36 0.30 0.24
V5 0.45 0.40 0.35 0.30 0.25 0.20
V6 0.36 0.32 0.28 0.24 0.20 0.16

> Rresid <- R - f1$loadings %*% t(f1$loadings) #show the residuals


> round(Rresid,2)

V1 V2 V3 V4 V5 V6
V1 0.19 0.00 0.00 0.00 0.00 0.00
V2 0.00 0.36 0.00 0.00 0.00 0.00
V3 0.00 0.00 0.51 0.00 0.00 0.00
V4 0.00 0.00 0.00 0.64 0.00 0.00
V5 0.00 0.00 0.00 0.00 0.75 0.00
V6 0.00 0.00 0.00 0.00 0.00 0.84
6.2 Exploratory Factor Analysis: a latent variable model 155

1 1
E = tr((S − Σ )S−1 )2 = tr(I − Σ S−1 )2 . (6.7)
2 2
This is known as generalized least squares (GLS ) or weighted least squares (WLS ). Similarly,
if the residuals are weighted by the inverse of the model, Σ , minimizing
1 1
E = tr((S − Σ )Σ −1 )2 = tr(SΣ −1 − I)2 (6.8)
2 2
will result in a model that maximizes the likelihood of the data. This procedure, maximum
likelihood estimation (MLE ) is also seen as finding the minimum of

1� � � �
E= tr(Σ −1 S) − ln �Σ −1 S� − p (6.9)
2
where p is the number of variables (Jöreskog, 1978). Perhaps a helpful intuitive explanation
of Equation 6.9 is that if the model is correct, then Σ = S and thus Σ −1 S = I. The trace of
an identity matrix of rank p is p, and the logarithm of |I| is 0. Thus, the value of E if the
model has perfect fit is 0. With the assumption of multivariate normality of the residuals,
and for large samples, a χ 2 statistic can be estimated for a model with p variables and f
factors (Bartlett, 1951; Jöreskog, 1978; Lawley and Maxwell, 1962):
� � � �
χ 2 = tr(Σ −1 S) − ln �Σ −1 S� − p (N − 1 − (2p + 5)/6 − (2 f )/3) . (6.10)

This χ 2 has degrees of freedom:

d f = p ∗ (p − 1)/2 − p ∗ f + f ∗ ( f − 1)/2. (6.11)

That is, the number of lower off-diagonal correlations - the number of unconstrained loadings
(Lawley and Maxwell, 1962).

6.2.2.1 Minimal Residual Factor Analysis

All of the previous factor analysis procedures attempt to optimize the fit of the model matrix
(Σ ) to the correlation or covariance matrix (S). The diagonal of the matrix is treated as
mixture of common variance and unique variance and the problem becomes one of estimating
the common variance (the communality of each variable). An alternative is to ignore the
diagonal and to find that model which minimizes the squared residuals of the off diagonal
elements (Eckart and Young, 1936; Harman and Jones, 1966). This is done in the fa function
using the “minres” (default) option by finding the solution that minimizes
1
1 ((S − I) − (Σ − tr(Σ ))2 1� . (6.12)
2
The advantage of the minres solution is that it does not require finding the inverse of either
the original correlation matrix (as do GLS and WLS ) nor of the model matrix (as does
MLE , and thus can be performed on non-positive definite matrices or matrices that are not
invertible.
156 6 Constructs, Components, and Factor models

6.2.2.2 Factor analysis by using the optim function

Factor analysis is one example of an iterative search to minimize a particular criterion. The
easiest way of doing this is to use the optim function which will find a vector of values that
minimize or maximize a particular criterion. If the function has an analytic derivative, the
optimization will be much faster. The fa function in psych uses optim to do factor analysis
optimizing most of the various criteria discussed above (all except principal axis which is done
by direct iteration). Conceptually, all of these criteria find the values of the communalities
that produce an eigen value decomposition that minimizes a particular definition of error.
That is, the fitting procedure for minimum residual factor analysis minimizes the squared
off diagonal elements of the residual matrix, ordinary least squares minimizes the error in
Equation 6.6, weighted least squares minimizes the error in Equation 6.7, and maximum
likelihood minimizes the error in Equations 6.8 or 6.9. The minimization function in fa is
taken (almost directly) from the fitting function used in factanal and generalized for the
OLS and GLS methods.

6.2.3 Comparing extraction techniques

Assuming that the residual variance reflects normally distributed random error, the most
elegant statistical solution is that of maximum likelihood (Lawley and Maxwell, 1963). If
the residuals are not normally distributed, for instance if they represent many individual
nuisance factors (MacCallum and Tucker, 1991), then maximum likelihood techniques do
not recreate the major factor pattern nearly as well as ordinary least squares techniques
(MacCallum et al., 2007). Other times that the MLE technique is not recommended is when
it will not converge or when the data are particularly noisy. Many demonstrations of factor
structures assume that except for the major factors, all residuals are normally distributed
around 0. This is the structure assumed when doing ML factor analysis. An alternative,
and perhaps more realistic situation, is that the there are a few major (big) factors and
many minor (small) factors. The challenge is thus to identify the major factors. sim.minor
generates such structures. The structures generated can be thought of as having a a major
factor structure with some small correlated residuals.
However, in the normal case, factor analysis using either maximum likelihood in factanal
or fa functions is probably most appropriate. In particular, factanal can use multiple start
values and iterates to solutions that avoid Heywood (1931) cases of negative uniquenesses.
An example of a one factor solution is shown in Table 6.5. For the data in Table 6.1 it is
necessary to specify that the input is from covariance or correlation matrix rather than from
a raw data matrix. The number of factors to extract is specified at 1. Table 6.5 shows the
factor loadings, uniquenesses, and goodness of fit. The number of observations was arbitrarily
set at 100 for this demonstration. The loadings are the correlations of the tests or items with
the factors. The sum of the squared loadings for the factor is the eigenvalue of the factor and
is the sum across all the items of the amount of variance accounted for by the factor. The
uniquenesses are ui = 1 − h2i where the communalities, h2 are the elements of the diagonal of
FF� The χ 2 measure of fit is perfect (because the correlation matrix was in fact generated
from the factor matrix with no error). The probability is 1 because the fit is perfect.
The minimum residual and principal axes models do not require finding the inverse of
either the original correlation matrix, R, or of the model matrix, FF� . As a consequence they
6.2 Exploratory Factor Analysis: a latent variable model 157

Table 6.5 One factor maximum likelihood solution to the correlation matrix created in Table 6.1
using maximum likelihood estimation by the fa function. The number of observations is set arbitrarily
to 100 to demonstrate how to run the function.

> fm1 <- fa(R,nfactors=1,n.obs=100,fm="ml")


> fm1

Factor Analysis using method = ml


Call: fa(r = R, nfactors = 1, n.obs = 100, fm = "ml")
V MR1 h2 u2
1 1 0.9 0.81 0.19
2 2 0.8 0.64 0.36
3 3 0.7 0.49 0.51
4 4 0.6 0.36 0.64
5 5 0.5 0.25 0.75
6 6 0.4 0.16 0.84

MR1
SS loadings 2.71
Proportion Var 0.45

Test of the hypothesis that 1 factor is sufficient.

The degrees of freedom for the model is 9 and the fit was 0
The number of observations was 100 with Chi Square = 0 with prob < 1

Measures of factor score adequacy MR1


Correlation of scores with factors 0.94
Multiple R square of scores with factors 0.89
Minimum correlation of factor score estimates 0.78
Validity of unit weighted factor scores 0.91

will produce solutions for singular matrices where the maximum likelihood or generalized
least squares or weighted least squares will not. The Harman.Burt correlation matrix of eight
emotional variables in the Harman data set is an example of where a minres solution will
work but a maximum likelihood one will not.
The principal axes algorithm in the psych iteratively estimates the communalities by
repeatedly doing an eigen value decomposition of the original correlation matrix where the
diagonal values are replaced by the estimated values from the previous iteration. When these
values fail to change, the algorithm terminates. This differs from the principal axes method
discussed by Harman (1976) which starts with an initial estimate of communalities and does
not iterate. The iterative solution matches what Harman (1976) refers to as a minres solution
and what SPSS calls a ULS solution. The minres solution in the fa function seems to more
closely approximate that of a mle solution.

6.2.4 Exploratory analysis with more than one factor/component

Originally developed by Spearman (1904a) to examine one factor of ability, factor analysis is
now routinely used for the case of more than one underlying factor. Consider the correlation
158 6 Constructs, Components, and Factor models

matrix in Table 6.6 generated by a two factor model. The correlation matrix was created
by matrix multiplication of a factor matrix, F with its transpose, R = FF� and then labels
were added using the rownames function. A much simpler way would have been to use the
sim function. Although not shown, fitting one factor/component models to this correlation
matrix provides a very poor solution, and fitting a two component model (Table 6.6) once
again tends to overestimate the effects of the weaker variables. The residuals are found by
calculation (R∗ = R − CC� but could have been found by requesting the residuals from the
output of principal, Rresid = pc2$residual. Fitting a two factor model estimated by fa
correctly recreates the pattern matrix (Table 6.7).

6.2.5 Comparing factors and components- part 1

Although on the surface, the component model and factor model appear to very similar
(compare Tables 6.6 and 6.7), they are logically very different. In the components model,
components are linear sums of the variables. In the factor model, on the other hand, factors
are latent variables whose weighted sum accounts for the common part of the observed
variables. In path analytic terms, for the component model, arrows go from the variables
to the components, while in the factor model they go from the factors to the variables
(Figure 6.1). Logically this implies that the addition of one or more variables into a factor
space should not change the factors, but should change the components.
An example of this is when two additional variables are added to the correlation matrix
(Table 6.8). The factor pattern does not change for the previous variables, but the component
pattern does. Why is this? Because the components are aimed at accounting for all of the
variance of the matrix, adding new variables increases the amount of variance to be explained
and changes the previous estimates. But the common part of the variables (that which is
estimated by factors) is not sensitive to the presence (or absence) of other variables. This
addition of new variables should make a larger difference when there are only a few variables
representing a factor.
The difference between components and factors may also be seen by examining the effect
of the number of markers per factor/component when loadings are plotted as a function
of the number of variables defining a factor/component. For five uni-dimensional correlation

matrices
√ with correlations ranging from .05 to .55 (and thus factor loadings ranging from .05
to .55) and the number of variables ranging from 2 to 20, the factor loadings in all cases are
exactly the square root of the between item correlations. But the component loadings are a
function of the correlations (as they should be) and of the number of variables. The component
loadings asymptotically tend towards the corresponding factor loadings (Figure 6.2). With
fewer than five to ten variables per component, and particularly for low communality items,
the inflation in the component loadings is substantial. Although a fundamental difference
between the two models, this problem of the additional variable is most obvious when there
are not very many markers for each factor and becomes less of an empirical problem as the
number of variables per component increases (Widaman, 1993).
6.2 Exploratory Factor Analysis: a latent variable model 159

Table 6.6 A somewhat more complicated model than Table tab:fa:pa has two factors. The analysis
examines the first two principal components. Compare this with the analysis of the the first two factors
using minimum residual factor analysis (Table 6.7). Both solutions are tested by trying to recreate the
original matrix. Examine the size of the residual matrix, Rresid. A simpler way to construct the data
matrix would have been to use the sim function.

>F <- matrix(c(.9,.8,.7,rep(0,6),.8,.7,.6),ncol=2) #the model


> rownames(F) <- paste("V",seq(1:6),sep="") #add labels
> colnames(F <- c("F1", "F2")
> R <- F %*% t(F) #create the correlation matrix
> diag(R) <- 1 #adjust the diagonal of the matrix
> R

V1 V2 V3 V4 V5 V6
V1 1.00 0.72 0.63 0.00 0.00 0.00
V2 0.72 1.00 0.56 0.00 0.00 0.00
V3 0.63 0.56 1.00 0.00 0.00 0.00
V4 0.00 0.00 0.00 1.00 0.56 0.48
V5 0.00 0.00 0.00 0.56 1.00 0.42
V6 0.00 0.00 0.00 0.48 0.42 1.00

> pc2 <- principal(R,2)


> pc2

Uniquenesses:
V1 V2 V3 V4 V5 V6
0.182 0.234 0.309 0.282 0.332 0.409
Loadings:
PC1 PC2
V1 0.90
V2 0.88
V3 0.83
V4 0.85
V5 0.82
V6 0.77
PC1 PC2
SS loadings 2.273 1.988
Proportion Var 0.379 0.331
Cumulative Var 0.379 0.710

> round(pc2$loadings %*% t(pc2$loadings),2)

V1 V2 V3 V4 V5 V6
V1 0.81 0.79 0.75 0.00 0.00 0.00
V2 0.79 0.77 0.73 0.00 0.00 0.00
V3 0.75 0.73 0.69 0.00 0.00 0.00
V4 0.00 0.00 0.00 0.72 0.70 0.65
V5 0.00 0.00 0.00 0.70 0.67 0.63
V6 0.00 0.00 0.00 0.65 0.63 0.59

> Rresid <- R - pc2$loadings %*% t(pc2$loadings)


> round(Rresid,2)

V1 V2 V3 V4 V5 V6
V1 0.19 -0.07 -0.12 0.00 0.00 0.00
V2 -0.07 0.23 -0.17 0.00 0.00 0.00
V3 -0.12 -0.17 0.31 0.00 0.00 0.00
V4 0.00 0.00 0.00 0.28 -0.14 -0.17
V5 0.00 0.00 0.00 -0.14 0.33 -0.21
V6 0.00 0.00 0.00 -0.17 -0.21 0.41
160 6 Constructs, Components, and Factor models

Table 6.7 Comparing a two factor with the two component solution of Table 6.6. This analysis
examines the first two factors. Compare this with the analysis of the the first two principal components
(Tabel 6.6). Both solutions are tested by trying to recreate the original matrix. As should be the case,
the diagonal residuals are larger, but the off diagonal residuals are smaller for the factor solution.

> f2 <- fa(R,2)


> f2
> round(f2$loadings %*% t(f2$loadings),2)

Factor Analysis using method = minres


Call: fa(r = R, nfactors = 2)
V MR1 MR2 h2 u2
V1 1 0.9 0.81 0.19
V2 2 0.8 0.64 0.36
V3 3 0.7 0.49 0.51
V4 4 0.8 0.64 0.36
V5 5 0.7 0.49 0.51
V6 6 0.6 0.36 0.64

MR1 MR2
SS loadings 1.94 1.49
Proportion Var 0.32 0.25
Cumulative Var 0.32 0.57

Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the model is 4 and the fit was 0

Measures of factor score adequacy MR1 MR2


Correlation of scores with factors 0.94 0.88
Multiple R square of scores with factors 0.88 0.77
Minimum correlation of factor score estimates 0.75 0.53
Validity of unit weighted factor scores 0.92 0.86

V1 V2 V3 V4 V5 V6
V1 0.81 0.72 0.63 0.00 0.00 0.00
V2 0.72 0.64 0.56 0.00 0.00 0.00
V3 0.63 0.56 0.49 0.00 0.00 0.00
V4 0.00 0.00 0.00 0.64 0.56 0.48
V5 0.00 0.00 0.00 0.56 0.49 0.42
V6 0.00 0.00 0.00 0.48 0.42 0.36

> round(f2$residual,2)

V1 V2 V3 V4 V5 V6
V1 0.19 0.00 0.00 0.00 0.00 0.00
V2 0.00 0.36 0.00 0.00 0.00 0.00
V3 0.00 0.00 0.51 0.00 0.00 0.00
V4 0.00 0.00 0.00 0.36 0.00 0.00
V5 0.00 0.00 0.00 0.00 0.51 0.00
V6 0.00 0.00 0.00 0.00 0.00 0.64
6.3 Rotations and Transformations 161

Component loadings vary by number of variables

1.0
0.8
0.55

0.45
Component Loading

0.6

0.35

0.25
0.4

0.15

0.05
0.2
0.0

5 10 15 20

Number of Variables

Fig. 6.2 Component loadings are a function both of the item correlations and the number of variables
defining the component. The curved lines represent the component loadings for matrices with correla-
tions ranging from .05 (bottom line) to .55 (top line). The asymptotic value of the loadings is the factor
loadings or the square root of the corresponding correlation (shown on the right hand side). With fewer
than five to ten variables per component, the inflation in the component loadings is substantial.

6.3 Rotations and Transformations

The original solution of a principal components or principal axes factor analysis is a set of
vectors that best account for the observed covariance or correlation matrix, and where the
components or factors account for progressively less and less variance. But such a solution,
although maximally efficient in describing the data, is rarely easy to interpret. But what
makes a structure easy to interpret? Thurstone’s answer, simple structure, consists of five
rules (Thurstone, 1947, p 335):
(1) Each row of the oblique factor matrix V should have at least one zero.
(2) For each column p of the factor matrix V there should be a distinct set of r linearly inde-
pendent tests whose factor loadings vip are zero.
(3) For every pair of columns of V there should be several tests whose entries vip vanish in one
column but not in the other.
162 6 Constructs, Components, and Factor models

Table 6.8 Comparing factors and components when an additional variable is added. Comparing these
results to Table 6.7, it is important to note that the loadings of the old items remain the same.

> f <- matrix(c(.9,.8,.7,rep(0,3),.7,rep(0,4),.8,.7,.6,0,.5),ncol=2) #the model


> rownames(f) <- paste("V",seq(1:8),sep="") #add labels
> colnames(f) <- c("F1", "F2")
> R <- f %*% t(f) #create the correlation matrix
> diag(R) <- 1 #adjust the diagonal of the matrix
> R

V1 V2 V3 V4 V5 V6 V7 V8
V1 1.00 0.72 0.63 0.00 0.00 0.00 0.63 0.00
V2 0.72 1.00 0.56 0.00 0.00 0.00 0.56 0.00
V3 0.63 0.56 1.00 0.00 0.00 0.00 0.49 0.00
V4 0.00 0.00 0.00 1.00 0.56 0.48 0.00 0.40
V5 0.00 0.00 0.00 0.56 1.00 0.42 0.00 0.35
V6 0.00 0.00 0.00 0.48 0.42 1.00 0.00 0.30
V7 0.63 0.56 0.49 0.00 0.00 0.00 1.00 0.00
V8 0.00 0.00 0.00 0.40 0.35 0.30 0.00 1.00

> f2 <- factanal(covmat=R,factors=2)


> f2

Call:
factanal(factors = 2, covmat = R)
Uniquenesses:
V1 V2 V3 V4 V5 V6 V7 V8
0.19 0.36 0.51 0.36 0.51 0.64 0.51 0.75
Loadings:
Factor1 Factor2
V1 0.9
V2 0.8
V3 0.7
V4 0.8
V5 0.7
V6 0.6
V7 0.7
V8 0.5
Factor1 Factor2
SS loadings 2.430 1.740
Proportion Var 0.304 0.218
Cumulative Var 0.304 0.521
The degrees of freedom for the model is 13 and the fit was 0

(4) For every pair of columns of V, a large proportion of the tests should have zero entries in
both columns. This applies to factor problems with four or five or more common factors.
(5) For every pair of columns there should preferably be only a small number of tests with
non-vanishing entries in both columns.

Thurstone proposed to rotate the original solution to achieve simple structure.


A matrix is said to be rotated if it is multiplied by a matrix of orthogonal vectors that
preserves the communalities of each variable. Just as the original matrix was orthogonal, so
is the rotated solution. For two factors, the rotation matrix T will rotate the two factors θ
radians in a counterclockwise direction.
6.3 Rotations and Transformations 163

Table 6.9 Two principal components with two additional variables. Compared to Table 6.6, the
loadings of the old variables have changed when two new variables are added.

pc2 <- principal(R,2)


pc2

Uniquenesses:
V1 V2 V3 V4 V5 V6 V7 V8
0.194 0.271 0.367 0.311 0.379 0.468 0.367 0.575
Loadings:
PC1 PC2
V1 0.90
V2 0.85
V3 0.80
V4 0.83
V5 0.79
V6 0.73
V7 0.80
V8 0.65
PC1 PC2
SS loadings 2.812 2.268
Proportion Var 0.352 0.284
Cumulative Var 0.352 0.635

� �
cos(θ ) sin(θ )
T= (6.13)
−sin(θ ) cos(θ )

Generalizing equation 6.13 to larger matrices is straight forward:


 
1 ... 0 ... 0 ... 0
 0 ... cos(θ ) ... sin(θ ) ... 0 
 
 ... ... 0 1 0 ... 0 
T =  . (6.14)

 0 ... −sin(θ ) ... cos(θ ) ... 0 
 ... ... 0 ... 0 ... ... 
0 ... 0 ... 0 ... 1

When F is post-multiplied by T, T will rotate the ith and jth columns of F by θ radians in a
counterclockwise direction.

Fr = FT (6.15)
The factor.rotate function from the psych package will do this rotation for arbitrary angles
(in degrees) for any pairs of factors. This is useful if there is a particular rotation that is
desired.
As pointed out by Carroll (1953) when discussing Thurstone’s (1947) simple structure
as a rotational criterion “it is obvious that there could hardly be any single mathematical
expression which could embody all these characteristics.” (p 24). Carroll’s solution to this
was to minimize the sum of the inner products of the squared (rotated) loading matrix. An
alternative, discussed by Ferguson (1954) is to consider the parsimony of a group of n tests
with r factors to be defined as the average parsimony of the individual tests (I j ) where
164 6 Constructs, Components, and Factor models
r
I j = ∑ a4jm (6.16)
m

(the squared communality) and thus the average parsimony is


n r
I. = n−1 ∑ ∑ a4jm
j m

and to choose a rotation that maximizes parsimony.


Parsimony as defined in equation 6.16 is a function of the variance as well as the mean
of the squared loadings of a particular test on all the factors. For fixed communality h2 , it
will be maximized if all but one loading is zero; a variable’s parsimony will be maximal if
one loading is 1.0 and the rest are zero. In path notation, parsimony is maximized if one
and only one arrow is associated with a variable. This criterion, as well as the criterion of
maximum variance taken over factors has been operationalized as the quartimax 4 criterion
by Neuhaus and Wrigley (1954). As pointed out by Kaiser (1958), the criterion can rotate
towards a solution with one general factor, ignoring other, smaller factors.
If a general factor is not desired, an alternative measure of the parsimony of a factor,
similar to equation 6.16 is to maximize the variance of the squared loadings taken over items
instead of over factors. This, the varimax criterion was developed by Kaiser (1958) to avoid
the tendency to yield a general factor. Both of these standard rotations as well as many
others are available in the GPArotation package of rotations and transformations which uses
the Gradient Projection Algorithms developed by Jennrich (2001, 2002, 2004).

6.3.1 Orthogonal rotations

Consider the correlation matrix of eight physical measurements reported by Harman (1976)
and included as the Harman23.cor data set in base R. Factoring this set using factanal and
not rotating the solution produces the factor loadings shown in the left panel of Figure 6.3
and in Table 6.10. Rotating using the varimax criterion (Kaiser, 1958) produces the solution
shown in the right hand panel of Figure 6.3 and in Table 6.11 Kaiser (1958). An alternative
way to represent these solutions is in terms of path diagrams (Figure 6.4). Here, all loadings
less than .30 are suppressed.

6.3.2 Oblique transformations

Many of those who use factor analysis use it to identify theoretically meaningful constructs
which they have no reason to believe are orthogonal. This has lead to the use of oblique
transformations which allow the factors to be correlated. (The term oblique is used for
historical reasons regarding the use of reference factors which became oblique as the angles
between the factors themselves become acute.) Although the term rotation is sometimes used
4 The original programs for doing this and other rotations and transformations were written in FOR-
TRAN and tended to use all caps. Thus, one will see QUARTIMAX, VARIMAX, PROMAX as names
of various criteria. More recent usage has tended to drop the capitalization.
6.3 Rotations and Transformations 165

Unrotated Varimax rotated

weight

1.0

1.0
F1
weight chest.girth
bitro.diameter
chest.girth chest.width

0.5

0.5
bitro.diameter height
chest.width lower.leg
forearm
Factor2

Factor2
F1 arm.span
0.0

0.0
height
lower.leg
forearm
F2
-0.5

-0.5
arm.span

F2
-1.0

-1.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0

Factor1 Factor1

Fig. 6.3 Comparing unrotated and varimax rotated solutions for 8 physical variables from Harman
(1976). The first factor in the unrotated solution has the most variance but the structure is not
“simple”, in that all variables have high loadings on both the first and second factor. The first factor
seems to represent overall size, the second factor is a bipolar factor separating height from width. The
varimax rotation provides a simpler solution with factors easily interpreted as “lankiness” (factor 1) and
“stockiness” (factor 2). The rotation angle was 33.45 degrees clockwise. The original axes from the left
panel are shown as dashed lines in the right panel. Compare this figure with the path representation
in Figure 6.4.

for both orthogonal and oblique solutions, in the oblique case the factor matrix is not rotated
so much as transformed .
Oblique transformations lead to the distinction between the factor pattern and factor
structure matrices. The factor pattern matrix is the set of regression weights (loadings) from
the latent factors to the observed variables (see equation 6.4). The factor structure matrix
is the matrix of correlations between the factors and the observed variables. If the factors
are uncorrelated, structure and pattern are identical. But, if the factors are correlated, the
structure matrix (S) is the pattern matrix (F) times the factor intercorrelations φ ):

S = Fφ <=> F = Sφ −1 (6.17)

and the modeled correlation matrix is the pattern matrix times the structure matrix plus the
uniquenesses:
R = FS� = Fφ � F� + U2 = Sφ −1 S� + U2 (6.18)
Simple structure here really means “simple pattern”, for the goal is to find a inter factor cor-
relation matrix φ and a transformation matrix T that have a simple pattern matrix matching
the Thurstonian goals. An early solution to this problem was the quartimin algorithm devel-
oped by Carroll (1953) and then refined into the biquartimin Carroll (1957).
Alternatives to quartimin have included promax Hendrickson and White (1964) which
transforms a varimax solution to a somewhat simpler but oblique target. The Promax function
in psych is based upon the promax rotation option in factanal but returns the correlations
between the factors. Promax finds the regression weights that will best transform the original
166 6 Constructs, Components, and Factor models

Unrotated Varimax

height height
0.9
lower.leg 0.9 0.9
arm.span
0.9 0.9
forearm 0.6 0.9 F1
-0.3 forearm 0.9
bitro.diameter F1
0.6
lower.leg
chest.width 0.6
0.9
arm.span 0.7 weight
0.5
0.9
weight bitro.diameter 0.8
0.8 F2
chest.girth 0.4
chest.girth 0.6
-0.3 F2

0.7 chest.width
0.6

Fig. 6.4 An alternative way to show simple structure is in terms of a path diagram. Paths with
coefficients less than .3 are not drawn. Although the two models fit equally well, the original solution
(left panel) has more paths (is more complex) than the the varimax rotated solution to the right.
Compare this path diagram to the spatial representation in Figure 6.3.

orthogonal loadings to the varimax rotated loadings raised to an arbitrary power (default
of 4). More recent work in transformations has included the simplimax criterion (Kiers,
1994) which has been shown to do a good job of maximizing factor simplicity Lorenzo-Seva
(2003). Available as part of the GPArotation package are rotations that maximize factorial
simplicity using the criteria of Bentler (1977). Also available in the GPArotation package
is the geominQ transformation which has proved useful in exploratory structural equation
modeling (Asparouhov and Muthén, 2009) for it allows for solutions with some items having
complexity two. Yet more alternative transformations towards specified targets are possible
using the target.rot function. The default transformation for target.rot is towards an
independent cluster solution in which each variable loads on one and only one factor.
Consider the factor solution to the Harman 8 physical variable problem shown earlier.
By allowing the two dimensions to correlate, a simpler pattern matrix is achieved using an
oblimin transformation., although at the cost of introducing a correlation matrix between
the factors.
6.3 Rotations and Transformations 167

Table 6.10 Harman’s 8 physical variables yield a clear unipolar size factor with a secondary, bipolar
factor representing the distinction between height and width. The default rotation in factanal is
varimax and is oblimin in fa and so it is necessary to specify “none” to get the unrotated solution.

> data(Harman23.cor)
> f2 <- factanal(factors=2,covmat=Harman23.cor,rotation="none")
> f2

Call:
factanal(factors = 2, covmat = Harman23.cor, rotation = "none")
Uniquenesses:
height arm.span forearm lower.leg weight bitro.diameter chest.girth
0.170 0.107 0.166 0.199 0.089 0.364 0.416
chest.width
0.537
Loadings:
Factor1 Factor2
height 0.880 -0.237
arm.span 0.874 -0.360
forearm 0.846 -0.344
lower.leg 0.855 -0.263
weight 0.705 0.644
bitro.diameter 0.589 0.538
chest.girth 0.526 0.554
chest.width 0.574 0.365

Factor1 Factor2
SS loadings 4.434 1.518
Proportion Var 0.554 0.190
Cumulative Var 0.554 0.744

Test of the hypothesis that 2 factors are sufficient.


The chi square statistic is 75.74 on 13 degrees of freedom.
The p-value is 6.94e-11

6.3.3 Non-Simple Structure Solutions: The Simplex and


Circumplex

In contrast to Thurstone’s search for simple structure Thurstone (1947), Guttman (1954)
discussed two structures that do not have simple structure, but that are seen in ability and
interest tests. As discussed by Browne (1992), the simplex is a structure where it is possible
to order tests such that the largest correlations are between adjacent tests, and the smallest
correlations are between tests at the opposite end of the order. Patterns such as this are seen
in measures taken over time, with the highest correlations between tests taken at adjacent
times. Simplex patterns also occur when dichotomous items vary in difficulty. The resulting
correlation matrix will tend to have the highest correlations between items with equal or near
equal difficulty.
An alternative structure discussed by Guttman (1954) and Browne (1992) is the circum-
plex , where the pattern of correlations is such that the tests can be ordered so that the
correlations between pairs first decreases and then increases. This is a common solution in
the study of emotions and interpersonal relations. Introduced to the personality researcher by
168 6 Constructs, Components, and Factor models

Table 6.11 Varimax rotation applied to the 2 factor solution for 8 physical variables. Done using the
varimax function in GPArotation.

> fv <- varimax(f2$loadings) #loadings from prior analysis


> fv

Loadings:
Factor1 Factor2
height 0.865 0.287
arm.span 0.927 0.181
forearm 0.895 0.179
lower.leg 0.859 0.252
weight 0.233 0.925
bitro.diameter 0.194 0.774
chest.girth 0.134 0.752
chest.width 0.278 0.621

Factor1 Factor2
SS loadings 3.335 2.617
Proportion Var 0.417 0.327
Cumulative Var 0.417 0.744

$rotmat
[,1] [,2]
[1,] 0.8343961 0.5511653
[2,] -0.5511653 0.8343961

Unrotated Oblique Transformation


1.0

1.0

weight
weight
chest.girth
chest.girth bitro.diameter
0.5

0.5

bitro.diameter chest.width
chest.width
height
MR2

F1 lower.leg
0.0

0.0
F2

height forearm
lower.leg arm.span
forearm
arm.span
-0.5

-0.5

F2
-1.0

-1.0

-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0

MR1 F1

Fig. 6.5 An oblimin solution to the Harman physical variables problem. The left panel shows the
original solution, the right panel shows the oblique solution. The dashed line shows the location of the
transformed Factor 2.
6.3 Rotations and Transformations 169

Table 6.12 Factor or component solutions can be transformed using oblique transformations such
as oblimin, geominQ, Promax or target.rot.The solution provides a simpler pattern matrix. The
print.psych function defaults to show all loadings, but can be made to show a limited number of
loadings by adjusting the “cut” parameter.

> f2t <- fa(Harman23.cor$cov,2,rotate="oblimin",n.obs=305)


> print(f2t)

Factor Analysis using method = minres


Call: fa(r = Harman23.cor$cov, nfactors = 2, rotate = "oblimin", n.obs = 305)
item MR1 MR2 h2 u2
height 1 0.87 0.08 0.84 0.16
arm.span 2 0.96 -0.05 0.89 0.11
forearm 3 0.93 -0.04 0.83 0.17
lower.leg 4 0.88 0.04 0.81 0.19
weight 5 0.01 0.94 0.89 0.11
bitro.diameter 6 0.00 0.80 0.64 0.36
chest.girth 7 -0.06 0.79 0.59 0.41
chest.width 8 0.13 0.62 0.47 0.53

MR1 MR2
SS loadings 3.37 2.58
Proportion Var 0.42 0.32
Cumulative Var 0.42 0.74

With factor correlations of


MR1 MR2
MR1 1.00 0.46
MR2 0.46 1.00

Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the model is 13 and the objective function was 0.26
The number of observations was 305 with Chi Square = 76.38 with prob < 5.3e-11

Fit based upon off diagonal values = 1


Measures of factor score adequacy
[,1] [,2]
Correlation of scores with factors 0.98 0.96
Multiple R square of scores with factors 0.96 0.93
Minimum correlation of factor score estimates 0.91 0.85

Wiggins (1979) circumplex structures have been studied by those interested in the interper-
sonal dimensions of agency and communion Acton and Revelle (2002); Conte and Plutchik
(1981); Gurtman (1993, 1997); Gurtman and Pincus (2003). A circumplex may be represented
as coordinates in a two dimensional space, although some prefer to represent the items in
polar coordinates (e.g., Gurtman (1997)). Mood and emotion data are typically thought of as
representing two dimensions (energetic and tense arousal (Rafaeli and Revelle, 2006; Thayer,
1978, 1989) or alternatively Positive and Negative Affect (Larsen and Diener, 1992; Watson
and Tellegen, 1985)). Tests for circumplex structure have been proposed by Acton and Rev-
elle (2004) and Browne (1992). A subset of the Acton and Revelle (2004) tests have been
implemented in the circ.tests function.
170 6 Constructs, Components, and Factor models

The appeal of circumplex structures to those who use them seems to be that there is no
preferred set of axes and all rotations are equally interpretable. For those who look for under-
lying causal bases for psychometric structures, the circumplex solutions lack the simplicity
found in simple structure. To some extent, the distinction between simple and circumplex
structures is merely one of item selection. Measures of affect, for example, can be chosen
that show simple structure, or with a more complete sampling of items, that show circum-
plex structure. The simulation functions item.sim and circ.sim have been develop to help
understand the effects of item characteristics on structure.

Circumplex structure
0.6

14
13 15
12 16
0.4

11
10 17

9
0.2

18
PA2

0.0

8
19
20
7
-0.2

6 21
-0.4

5 22
4
23
24
-0.6

2 1
3

-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6

PA1

Fig. 6.6 A simulation of 24 items showing a circumplex structure. Simple structure would result if
the items shown in diamonds were not included. Items generated using circ.sim and plotted using
cluster.plot.

A mixture of a conventional five dimensional model of personality with a circumplex


solution recognizes that most items are of complexity two and thus need to be represented in
the 10 pairs of five dimensions Hofstee et al. (1992a). This solution (the “Abridged Big Five
Circumplex” or “AB5C”) recognizes that the primary loadings for items can be represented
in a five dimensional space, and that items are loaded on at most two dimensions. One way
of reporting factor structures that have a circumplex structure is to convert the loadings
6.3 Rotations and Transformations 171

into polar coordinates and organize the results by angle and radius length (see polar).
Polar coordinate representations of emotion words allow for a clearer understanding of what
words are near each other in the two dimensional space than does not the more conventional
Cartesian representation (Rafaeli and Revelle, 2006).

6.3.4 Hierarchical and higher order models

If factors or components are allowed to correlate, there is implicitly a higher order factor that
accounts for those correlations. This is the structure proposed for abilities (Carroll, 1993)
where lower level specific abilities may be seen as correlating and representing the effect of
a third stratum, general intelligence. An example of such a solution is discussed by Jensen
and Weng (1994) in terms of how to extract a g factor of intelligence. The Schmid-Leiman
transformation (Schmid and Leiman, 1957) that they apply is a constrained model of general
higher order solutions (Yung et al., 1999). This is implemented in the omega and schmid
functions in psych. An example applied to nine correlated variables is shown in Tables 6.13
and 6.14. Two alternative representations are seen in Figure 6.7. The first is a hierarchical
solution with a higher order factor accounting for the correlation between the lower order
factors. The second is solution in which the g factor identified in the hierarchical solution
is then partialled out of the items and then the correlations of these residualized items is
accounted for by three factors. These three lower level factors are orthogonal and represent
pure group without the influence of the general factor.
It is important to understand how this analysis is done. The original orthogonal factor
matrix (shown in Table 6.14 can be transformed obliquely using either the oblimin or tar-
get.rot transformation functions and then the correlations of those factors are themselves
factored (See the top figure in Figure 6.7. g loadings of the items are then found from the
product of the item loadings on the factors with the factor loadings on g. The resulting load-
ings on the orthogonal group factors are then found by subtracting the amount of variance
for each item associated with g from the item communality. The square root of this is the
loading on the orthogonalized group factors (Schmid and Leiman, 1957) (see the bottom half
of Figure 6.7). This sort of hierarchical or bifactor solution is used to find the ωh coefficient
of general factor saturation (see 7.2.5).
Given the normal meaning of the word hierarchy to represent a tree like structure, there
is an unfortunate naming confusion in these various models. While Gustafsson and Balke
(1993), McDonald (1999) and Yung et al. (1999) refer to the structure in the top panel
of Figure 6.7 as a higher order model and to solutions similar to those of the lower panel
(Schmid and Leiman, 1957) as hierarchical models or nested models, Mulaik and Quartetti
(1997) refers to the top panel as hierarchical and the lower panel as nested . Continuing the
confusion, Chen et al. (2006) refer to the upper panel as a second order factoring, and the
lower panel as a bifactor solution.

6.3.5 Comparing factor solutions

To evaluate the similarity between alternative factor or component solutions, it is possible to


find the factor congruence coefficient by using the factor.congruence function. For a single
172 6 Constructs, Components, and Factor models

Table 6.13 An example hierarchical correlation matrix taken from Jensen and Weng (1994) with
a bifactor solution demonstrating a Schmid-Leiman transformation and the extraction of a g factor.
Loadings < .001 have been supressed.

> ability <- sim.hierarchical() #create a hierarchical data structure


> round(ability,2)
> omega.ability <- omega(ability) #solve it to show structure
> print(omega.ability,cut=.001) #print the results, suppressing all values < .001

V1 V2 V3 V4 V5 V6 V7 V8 V9
V1 1.00 0.56 0.48 0.40 0.35 0.29 0.30 0.25 0.20
V2 0.56 1.00 0.42 0.35 0.30 0.25 0.26 0.22 0.18
V3 0.48 0.42 1.00 0.30 0.26 0.22 0.23 0.19 0.15
V4 0.40 0.35 0.30 1.00 0.42 0.35 0.24 0.20 0.16
V5 0.35 0.30 0.26 0.42 1.00 0.30 0.20 0.17 0.13
V6 0.29 0.25 0.22 0.35 0.30 1.00 0.17 0.14 0.11
V7 0.30 0.26 0.23 0.24 0.20 0.17 1.00 0.30 0.24
V8 0.25 0.22 0.19 0.20 0.17 0.14 0.30 1.00 0.20
V9 0.20 0.18 0.15 0.16 0.13 0.11 0.24 0.20 1.00

Omega
Call: omega(m = ability)
Alpha: 0.76
G.6: 0.76
Omega Hierarchical: 0.69
Omega H asymptotic: 0.86
Omega Total 0.8

Schmid Leiman Factor loadings greater than 0.001


g F1* F2* F3* h2 u2
V1 0.72 0.35 0.64 0.36
V2 0.63 0.31 0.49 0.51
V3 0.54 0.26 0.36 0.64
V4 0.56 0.42 0.49 0.51
V5 0.48 0.36 0.36 0.64
V6 0.40 0.30 0.25 0.75
V7 0.42 0.43 0.36 0.64
V8 0.35 0.36 0.25 0.75
V9 0.28 0.29 0.16 0.84

With eigenvalues of:


g F1* F2* F3*
2.29 0.29 0.40 0.40

general/max 5.74 max/min = 1.39


The degrees of freedom for the model is 12 and the fit was 0
6.3 Rotations and Transformations 173

Table 6.14 An orthogonal factor solution to the ability correlation matrix of Table 6.13 may be
rotated obliquely using oblimin and then those correlations may be factored. g loadings of the items
are then found from the product of the item loadings on the factors times the factor loadings on g.
This procedure results in the solution seen in Table 6.13.

> print(omega.abilty,all=TRUE) #show all the output from omega

$schmid$orthog

Loadings:
F1 F2 F3
V1 0.70 0.30 0.25
V2 0.61 0.26 0.22
V3 0.52 0.22 0.19
V4 0.24 0.63 0.18
V5 0.21 0.54 0.16
V6 0.17 0.45 0.13
V7 0.17 0.15 0.56
V8 0.14 0.12 0.46
V9 0.11 0.10 0.37

F1 F2 F3
SS loadings 1.324 1.144 0.884
Proportion Var 0.147 0.127 0.098
Cumulative Var 0.147 0.274 0.372

$schmid$oblique
F1 F2 F3
V1 0.8 0.0 0.0
V2 0.7 0.0 0.0
V3 0.6 0.0 0.0
V4 0.0 0.7 0.0
V5 0.0 0.6 0.0
V6 0.0 0.5 0.0
V7 0.0 0.0 0.6
V8 0.0 0.0 0.5
V9 0.0 0.0 0.4

$schmid$phi
F1 F2 F3
F1 1.00 0.72 0.63
F2 0.72 1.00 0.56
F3 0.63 0.56 1.00

$schmid$gloading

Loadings:
Factor1
F1 0.9
F2 0.8
F3 0.7

Factor1
SS loadings 1.940
Proportion Var 0.647
174 6 Constructs, Components, and Factor models

Hierarchical solution

V1
0.8
V2 0.7
0.6 F1
V3
0.9
V4
0.7
V5 0.6 F2 0.8 g
0.5
V6 0.7
V7 0.6
0.5 F3
V8 0.4
V9

Schmid Leiman

V1
0.4
F1* 0.3 V2
0.3
V3 0.7
0.6
V4 0.5
0.4 0.6
F2* 0.4 V5 0.5 g
0.3 0.4
V6 0.4
0.4
V7 0.3
0.4
F3* 0.4 V8
0.3
V9

Fig. 6.7 The ωh (omega-hierarchical) coefficient of reliability may be calculated from a hierarchical
factor structure by using the Schmid Leiman transformation. The original factors are transformed to
an oblique solution (top panel) and then the general factor is extracted from the items. The remaining
factors (F1*, F2*, and F3*) are now orthogonal and reflect that part of the item that does not have g
in it. Figures drawn using the omega function. The correlation matrix is shown in Table 6.13

pair of factors this is simply


∑n1 Fxi Fyi
rc = � . (6.19)
∑n1 Fxi2 ∑n1 Fyi2
The congruence coefficient is the cosine of the angle of the two factor loading vectors, taken
from the origin. Compare this to Equation 4.22. The correlation is the cosine of the angle of
the two loadings vectors, taken from the mean loading on each vector. Generalizing Equa-
tion 6.19 to the matrix of congruence coefficients, and expressing it in matrix algebra, the
congruence coefficient Rc for two matrices of factor loadings Fx and Fy is

Rc = diag(Fx Fx � )−1/2 Fx Fy � diag(Fy Fy � )−1/2 (6.20)

where
diag(Fx Fx � )−1/2
6.4 The number of factors/components problem 175

is just dividing by the square root of the sum of squared loadings. Although similar in form
to a correlation coefficient (Equation 4.24), the difference is that the mean loading is not
subtracted when doing the calculations. Thus, even though two patterns are highly correlated,
if they differ in the mean loading, they will not have a high factor congruence. Consider the
factor congruence and correlations between the factors found in Tables 6.8 and 6.9.
> round(factor.congruence(f2,pc2),3)
PC1 PC2
Factor1 0.998 0.000
Factor2 0.000 0.997

> round(cor(f2$loadings,pc2$loadings),3)
PC1 PC2
Factor1 0.997 -0.981
Factor2 -0.969 0.993
Although the cross component loading congruences are 0.0, they are highly negatively corre-
lated (-.97 and -.98). The negative correlations are reflecting that the patterns of loadings are
high versus 0 and Factor or Component 1, and 0 versus high on Factor or Component 2. By
subtracting the mean loading, as is done in the correlation, this makes the cross components
(1 with 2 and 2 with 1) highly negatively correlated even they they are not congruent.

6.4 The number of factors/components problem

A fundamental question in both components and factor analysis is how many components or
factors to extract? While it is clear that more factors will fit better than fewer factors, and that
more components will always summarize the data better than fewer such an improvement
in fit has a cost in parsimony. Henry Kaiser once said that “a solution to the number-of
factors problem in factor analysis is easy”, that he used to make up one every morning before
breakfast. But the problem, of course is to find the solution, or at least a solution that others
will regard quite highly not as the best” (Horn and Engstrom, 1979). There are at least eight
procedures that have been suggested:
1) Extracting factors until the χ 2 of the residual matrix is not significant. Although sta-
tistically this makes sense, it is very sensitive to the number of subjects in the data set. That
is, the more subjects analyzed, the more factors or components that will be extracted. The
rule is also sensitive to departures from normality in the data as well as the assumption that
residual error is random rather than systematic but small. χ 2 estimates are reported for the
maximum likelihood solution done by the factanal function and all solutions using fa.
2) Extracting factors until the change in χ 2 from factor n to factor n+1 is not significant.
The same arguments apply to this rule as the previous rule. That is, increases in sample size
will lead to an increase in the number of factors extracted.
3) Extracting factors until the eigenvalues of the real data are less than the corresponding
eigenvalues of a random data set of the same size (parallel analysis)(Dinno, 2009; Hayton
et al., 2004; Horn, 1965; Humphreys and Montanelli, 1975; Montanelli and Humphreys, 1976).
The fa.parallel function plots the eigenvalues for a principal components solution as well
as the eigen values when the communalities are estimated by a one factor minres solution
176 6 Constructs, Components, and Factor models

for a given data set as well as that of n (default value = 20) randomly generated parallel
data sets of the same number of variables and subjects. In addition, if the original data are
available, 20 artificial data sets are formed by resampling (with replacement) from these data
(Figure 6.8). A typical application is shown for 24 mental ability tests discussed by Harman
(1960, 1976) and reported originally by Holzinger and Swineford (1939) and available as the
Harman74.cor data set. Parallel analysis is partially sensitive to sample size in that for large
samples the eigenvalues of random factors will tend to be very small and thus the number of
components or factors will tend to be more than other rules. Although other parallel analysis
functions use SMC s as estimates of the communalties (e.g., the paran in the paran package),
simulations using the sim.parallel function suggest that the fa.parallel solution is a
more accurate estimate of the number of major factors in the presence of many small, minor
factors.

Parallel Analysis Scree Plots

PC Actual Data
8

PC Simulated Data
FA Actual Data
eigenvalues of principal components and factor analysis

FA Simulated Data
6
4
2
0

5 10 15 20

Factor Number

Fig. 6.8 A parallel analysis of 24 mental ability tests is found by the fa.parallel function. The
eigenvalues of the principal components solution of 20 random data sets suggests four components.
A similar solution is found from the parallel factors. When the raw data are available, eigen values
for the correlation matrices formed from random samples (with replacement) of the raw data are also
examined.
6.4 The number of factors/components problem 177

4) Plotting the magnitude of the successive eigenvalues and applying the scree test (a
sudden drop in eigenvalues analogous to the change in slope seen when scrambling up the
talus slope of a mountain and approaching the rock face) (Cattell, 1966c). In the example
of the 24 mental tests case of Harman-Holzinger-Swineford (Figure 6.8), a strong argument
could be made for either one factor or four factors.
5) Extracting factors as long as they are interpretable. A surprisingly compelling rule for
the number of factors or components to extract. This basically reflects common sense. The
disadvantage is that investigators differ in their ability or desire to interpret factors. While
some will find a two factor solution most interpretable, others will prefer five.
6) Using the Very Simple Structure Criterion (VSS), (Revelle and Rocklin, 1979) (Fig-
ure 6.9). Most people, when interpreting a factor solution, will highlight the large (salient)
loadings and ignore the small loadings. That is, they are interpreting the factor matrix as if
it had simple structure. How well does this simplified matrix reproduce the original matrix.
That is, if c is the complexity (number of non zero loadings) of an item and maxc means the
greatest (absolute) c loadings for an item, then find the Very Simple Structure matrix, Sc ,
where � �
fi j i f ( fi j = maxc ( f i.))
sci j =
0 otherwise
Then let
R∗sc = R − SS� (6.21)
and
1R∗sc 2 1� − diag(R∗sc )
V SSc = 1 −
1R2 1� − diag(R)
That is, the VSS criterion for a complexity c is 1 - the squared residual correlations divided
by the squared observed correlations, where the residuals are found by the “simplified” factor
equation 6.21. Example of two such VSS solutions, one for the Holzinger-Harmon problem
of 24 mental tests and one for a set of 24 simulated variables representing a circumplex
structure in two dimensions are seen in Figure 6.9. The presence of a large general factor in
the Holzinger data set leads to an identification of one factor as being optimal if the tests are
seen as having complexity one, but four factors if the tests are of complexity two or three.
Compare this result to those suggested by the scree test or the parallel factors tests seen in
Figure 6.8. Unlike R∗ as found in equation 6.5, R∗sc as found in equation 6.21 is sensitive to
rotation and can also be used for evaluating alternative rotations. The Very Simple Structure
criterion is implemented in the VSS and VSS.plot functions in the psych package.
7) Using the Minimum Average Partial criterion (MAP). Velicer (1976) proposed that the
appropriate number of components to extract is that which minimizes the average (squared)
partial correlation. Considering the (n + p)x(n + p) super matrix composed of the nxn correla-
tion matrix R, the nxp component matrix C, and the pxp covariance matrix of the components
CC’
� � � �
R C� R C�
=
C CC� C I
then, partialling the components out of the correlation matrix produces a matrix of partial
covariances, R∗ = R − CC� and the partial correlation matrix R# is found by dividing the
partial covariances by their respective partial variances

R# = D−1/2 R∗ D−1/2 (6.22)


178 6 Constructs, Components, and Factor models

24 Mental Tests 24 Circumplex Items

1.0

1.0
4
3 4
3 4 4 4
3 2 3 3 4 4 4
2 2 3 3 4
3 4
3 3 3 3
2 2 2
2 2 2
0.8

0.8
1 2 2 2 2
1 2
1
1
Very Simple Structure Fit

Very Simple Structure Fit


1
1
0.6

0.6
1 1 1
1 1 1
1 1
1 1
0.4

0.4
0.2

0.2
0.0

0.0

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

Number of Factors Number of Factors

Fig. 6.9 Very Simple Structure for 24 mental ability tests (left panel) and 24 circumplex structured
data (right panel). For the 24 mental tests, the complexity 1 solution suggests a large general factor
but for complexity two or three, the best solution seems to be four factors. For the 24 circumplex
items, the complexity one solution indicates that two factors are optimal, and the difference between
the complexity one and two factor solutions suggests that the items are not simple structured (which,
indeed, they are not.)

where D = diag(R∗ ). The MAP criterion is just the sum of squared off diagonal elements of
R# . The logic of the MAP criterion is that although the residual correlations in R∗ will become
smaller as more factors are extracted, so will residual variances (diag(R∗ ). The magnitude of
the partial correlations will thus diminish and then increase again when too many components
are extracted. MAP is implemented in psych as part of the VSS function. For the Harman data
set, MAP is at a minimum at four components, for the two dimensional circumplex strucure,
at two:
> round(harman.vss$map,4)
[1] 0.0245 0.0216 0.0175 0.0174 0.0208 0.0237 0.0278 0.0301
> round(circ.vss$map,3)
[1] 0.044 0.006 0.008 0.011 0.014 0.017 0.020 0.024
6.5 The number of subjects problem 179

This result agrees with VSS solution for these two problems.
8) Extracting principal components until the eigenvalue <1 (Kaiser, 1970). This is proba-
bly the most used and most disparaged rule for the number of components. The logic behind
it is that a component should be at least as large as a single variable. Unfortunately, in
practice the λ > 1 rule seems to be a very robust estimate of the number of variables divided
by three (Revelle and Rocklin, 1979; Velicer and Jackson, 1990b).
Each of the procedures has its advantages and disadvantages. Using either the χ 2 test
or the change in χ 2 test is, of course, sensitive to the number of subjects and leads to the
nonsensical condition that if one wants to find many factors, one simlpy runs more subjects.
The scree test is quite appealling but can lead to differences of interpretation as to when the
scree “breaks”. Extracting interpretable factors means that the number of factors reflects the
investigators creativity more than the data. VSS , while very simple to understand, will not
work very well if the data are very factorially complex. (Simulations suggests it will work fine
if the complexities of some of the items are no more than 2). The eigenvalue of 1 rule, although
the default for many programs, is fairly insensitive to the correct number and suggests that
the number of factors is roughly 1/3 of the number of variables (Revelle and Rocklin, 1979;
Velicer and Jackson, 1990b). It is the least recommended of the procedures (for recent reviews
discussing how too many articles still use the eigen value of 1 rule, see Dinno (2009). MAP
and VSS have the advantage that simulations show that they achieve a minimum (MAP) or
maximum (VSS) at the correct number of components or factors (Revelle and Rocklin, 1979;
Zwick and Velicer, 1982).
The number of factors problem was aptly summarized by Clyde Coombs who would say
that determining the number of factors was like saying how many clouds were in the sky. On a
clear day, it was easy, but when it was overcast, the problem became much more complicated.
That the number of factors problem is important and that the standard default option in
commercial packages such as SPSS is inappropriate has been frequently bemoaned (Dinno,
2009; Preacher and MacCallum, 2003) and remains a challenge to the interpretation of many
factor analyses. Perhaps the one clear recommendation is to not use the eigen value of 1
rule. Looking for consistency between the results of parallel analysis, the scree, the MAP
and VSS tests and then trying to understand why the results differ is probably the most
recommended way to choose the number of factors.

6.5 The number of subjects problem

A recurring question in factor analysis and principal components analysis is how many sub-
jects are needed to get a stable estimate? Although many rules of thumb have been proposed,
suggesting minima of 250 or 500 and ratios as high as 10 times as many subjects as variables,
the answer actually depends upon the clarity of the structure being examined (MacCallum
et al., 1999, 2001). Most important is the communality of the variables. As the communality
of the variables goes up, the number of observations necessary to give a clean structure goes
down. A related concept is one of overdetermination. As the number of markers for a factor
increase, the sample size needed decreases. That is, if the ratio of the number of variables to
number of factors is high (e.g., 20:3), and the communalities are high (> .70), then sample
sizes as small as 60-100 are quite adequate. But, if that ratio is low, then even with large
sample sizes (N=400) , admissible solutions are not guaranteed. Although more subjects are
always better, it is more important to have good markers for each factor (high communalities)
180 6 Constructs, Components, and Factor models

as well as many markers (a high variable to factor ratio) than it is to increase the number of
subjects.
Unfortunately, although it is wonderful advice to have many markers of high communality
(as is the case when examining the structure of tests), if using factor analysis to analyze the
structure of items rather than tests, the communalilties will tend to be fairly low (.2 < h2 < .4
is not uncommon). In this case, increasing the number of markers per factor, and increasing
the number of subjects is advised. It is useful to examine the likelihood of achieving an ac-
ceptable solution as a function of low communalities, small samples, or the number of markers
by simulation. Using the sim.circ, sim.item or VSS.sim functions, it is straightforward to
generate samples of various sizes, communalities, and complexities.

6.6 The problem of factoring items

It is not uncommon to use factor analysis or related methods (see below) to analyze the
structure of ability or personality items. Unfortunately, item difficulty or endorsement fre-
quencies can seriously affect the conclusions. As discussed earlier (4.5.1.4), the φ correlation,
which is a Pearson correlation for dichotomous items, is greatly affected by differences in
item endorsement frequencies. This means that the correlations between items reflects both
shared variance as well as differences in difficulty. Factoring the resulting correlation matrix
can lead to a spurious factor reflecting differences in difficulty. Consider the correlations of
9 items all loadings of .7 on a factor, but differing in their endorsement frequencies from 4%
to 96%. These items were generated using the sim.npn function discussed in Chapter 8. A
parallel analysis of these correlations suggests that two factors are present. These two corre-
lated factors represent the two ends of the difficulty spectrum with the first factor reflecting
the first six items, the second the last three. The same data matrix yields more consistent
tetrachoric correlations and parallel analysis suggests one factor is present. A graphic figure
using fa.diagram of these two results is in Figure 6.10.

6.7 Confirmatory Factor Analysis

Exploratory factor analysis is typically conducted in the early stages of scale development.
The questions addressed are how many constructs are being measured; how well are they
measured, and what are the correlations between the constructs? Exploratory factor analysis
is just that, exploratory. It is a very powerful tool for understanding the measures that one is
using and should be done routinely when using a new measure. Statistical and psychometric
goodness of fit tests are available that analyze how well a particular exploratory model
accounts for the data.
But many investigators want to go beyond exploring their data and prefer to test how
well a particular model, derived a priori, fits the data. This problem will be addressed more
fully when we consider structural equation modeling, SEM , in chapter 10 which combine a
measurement model with a structural model (or a validity model). Confirmatory factor anal-
ysis is done when evaluating the measurement model. The process is logically very simple:
specify the exact loadings in the pattern matrix and the exact correlations in the factor inter-
correlation matrix. Then test how well this model reproduces the data. This is rarely done,
6.7 Confirmatory Factor Analysis 181

Table 6.15 Nine items are simulated using the sim.npn funciton with endorsement frequencies ranging
from 96% to 4%. Part A: The correlations of 9 items differing in item endorsement but all with equal
saturation of a single factor. The number of factors suggested by fa.parallel is two. Part B: The
correlations based upon the tetrachoric correlation are both higher and more similar. The number of
factors suggested by fa.parallel is one.

Part A:
> set.seed(17)
> items <- sim.npn(9,1000,low=-2.5,high=2.5)$items
> describe(items)

var n mean sd median trimmed mad min max range skew kurtosis se
1 1 1000 0.96 0.21 1 1.00 0 0 1 1 -4.44 17.73 0.01
2 2 1000 0.90 0.30 1 1.00 0 0 1 1 -2.72 5.40 0.01
3 3 1000 0.80 0.40 1 0.88 0 0 1 1 -1.51 0.27 0.01
4 4 1000 0.69 0.46 1 0.73 0 0 1 1 -0.80 -1.36 0.01
5 5 1000 0.51 0.50 1 0.51 0 0 1 1 -0.03 -2.00 0.02
6 6 1000 0.33 0.47 0 0.29 0 0 1 1 0.72 -1.49 0.01
7 7 1000 0.17 0.37 0 0.08 0 0 1 1 1.77 1.15 0.01
8 8 1000 0.10 0.29 0 0.00 0 0 1 1 2.74 5.51 0.01
9 9 1000 0.04 0.20 0 0.00 0 0 1 1 4.50 18.26 0.01

> round(cor(items),2)

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]


[1,] 1.00 0.29 0.22 0.22 0.20 0.14 0.10 0.07 0.05
[2,] 0.29 1.00 0.24 0.24 0.24 0.19 0.14 0.10 0.07
[3,] 0.22 0.24 1.00 0.30 0.30 0.27 0.20 0.14 0.11
[4,] 0.22 0.24 0.30 1.00 0.35 0.28 0.23 0.17 0.13
[5,] 0.20 0.24 0.30 0.35 1.00 0.32 0.28 0.25 0.15
[6,] 0.14 0.19 0.27 0.28 0.32 1.00 0.28 0.26 0.21
[7,] 0.10 0.14 0.20 0.23 0.28 0.28 1.00 0.31 0.22
[8,] 0.07 0.10 0.14 0.17 0.25 0.26 0.31 1.00 0.32
[9,] 0.05 0.07 0.11 0.13 0.15 0.21 0.22 0.32 1.00

Part B:
> tet <- tetrachoric(items)

Call: tetrachoric(x = items)


tetrachoric correlation
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1.00 0.62 0.52 0.56 0.64 0.58 0.50 0.39 0.24
[2,] 0.62 1.00 0.47 0.48 0.55 0.52 0.57 0.45 0.39
[3,] 0.52 0.47 1.00 0.51 0.53 0.56 0.57 0.47 0.53
[4,] 0.56 0.48 0.51 1.00 0.54 0.49 0.52 0.47 0.56
[5,] 0.64 0.55 0.53 0.54 1.00 0.50 0.53 0.59 0.45
[6,] 0.58 0.52 0.56 0.49 0.50 1.00 0.48 0.52 0.54
[7,] 0.50 0.57 0.57 0.52 0.53 0.48 1.00 0.57 0.51
[8,] 0.39 0.45 0.47 0.47 0.59 0.52 0.57 1.00 0.65
[9,] 0.24 0.39 0.53 0.56 0.45 0.54 0.51 0.65 1.00
182 6 Constructs, Components, and Factor models

Phi correlations Tetrachoric correlations

V2 V5

V4 0.5 V7
0.5
V3 0.5 V6 0.8
MR1 0.7
0.5
V1 0.5 V3 0.7
0.3 0.7
V5 0.5 V1 0.7 MR1
0.7
V6 V4 0.7
MR2 0.7
V8 0.6 V8 0.7
0.5
V9 0.4 V2

V7 V9

Fig. 6.10 Factor analyses of items can reflect difficulty factors unless the tetrachoric correlation is used.
fa.diagram(fa(items,2),main=”Phi correlatons”) and fa.diagram(fa(tet$rho),main=”Tetrachoric cor-
relations”)

but rather what is estimated is which loadings in a pattern matrix are non-zero and which
correlations in the factor intercorrelation matrix are non-zero. The typical model estimation
is done by a maximum likelihood algorithm (e.g. Jöreskog (1978)) in commercial packages
such as EQS Bentler (1995), LISREL Jöreskog and Sörbom (1999) or Mplus Muthén and
Muthén (2007) and open source programs such as Mx Neale (1994). SEM is implemented in
R in the sem package written by John Fox (2006) as is OpenMx.
Consider a simple model that four tests (V1 ... V4) are all measures of the same construct,
θ . The question to ask is then how well does each test measure θ ? For this example, the
four tests are created using the sim.congeneric function using population loadings on θ
of .8, .7, .6, and .5. The input to sem specifies that the four variables have loadings to be
estimated of a, b, c, and d. In addition, each variable has an error variance to be estimated
(parameters u, v, w, and x). The variance of θ is arbitrarily set to be 1. These instructions
are entered into the model matrix which is then given as input to sem (Table 6.16). That
these results are very similar to what is observed with exploratory factor analysis is apparent
when the standardized loadings from the sem output are compared to the results of factanal
(Table 6.17).
The previous example tested the hypothesis that all four variables had significant loadings
on the same factor. The results were identical to what was found using exploratory analysis.
A much more compelling use of a confirmatory model is to test a more restricted hypotheses,
e.g., that the loadings are the same. This is tested by revising the model to constrain the
coefficients a,b, c, and d to be equal (Table 6.18). The χ 2 value shows that this model is not
a good fit to the data. More importantly, the difference in χ 2 between the two models (.46
with df = 2 and 56.1 with df= 5) is also a χ 2 (55.5) with df = 3 or the difference in degrees
of freedom between the two models. Thus the tau equivalent model may be rejected in favor
of the (correct) congeneric model.
6.7 Confirmatory Factor Analysis 183

Table 6.16 Confirmatory factoring of four congeneric measures using the sem package. The data
are simulated using the sim.congeneric function which returns the model as well as the observed
correlation matrix and the simulated data if desired. The congeneric model allows the four factor
loadings to differ. Compare this solution to the more constrained model of tau equivalence (Table 6.18).

> set.seed(42) #set the random number generator to allow for a comparison with other solutions
> congeneric <- sim.congeneric(N=1000,loads = c(.8,.7,.6,.5),short=FALSE) #generate an artificial data set
> S <- cov(congeneric$observed)
> model.congeneric <- matrix(c("theta -> V1", "a", NA,
+ "theta -> V2", "b", NA,
+ "theta -> V3","c", NA,
+ "theta -> V4", "d", NA,
+ "V1 <-> V1", "u", NA,
+ "V2 <-> V2", "v", NA,
+ "V3 <-> V3", "w", NA,
+ "V4 <-> V4", "x", NA,
+ "theta <-> theta", NA, 1), ncol = 3, byrow = TRUE)
> model.congeneric #show the model

[,1] [,2] [,3]


[1,] "theta -> V1" "a" NA
[2,] "theta -> V2" "b" NA
[3,] "theta -> V3" "c" NA
[4,] "theta -> V4" "d" NA
[5,] "V1 <-> V1" "u" NA
[6,] "V2 <-> V2" "v" NA
[7,] "V3 <-> V3" "w" NA
[8,] "V4 <-> V4" "x" NA
[9,] "theta <-> theta" NA "1"

> sem.congeneric = sem(model.congeneric,S, 1000) #do the analysis


> summary(sem.congeneric, digits = 3)

Model Chisquare = 0.46 Df = 2 Pr(>Chisq) = 0.795


Chisquare (null model) = 910 Df = 6
Goodness-of-fit index = 1
Adjusted goodness-of-fit index = 0.999
RMSEA index = 0 90% CI: (NA, 0.0398)
Bentler-Bonnett NFI = 1
Tucker-Lewis NNFI = 1.01
Bentler CFI = 1
SRMR = 0.00415
BIC = -13.4
Normalized Residuals
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.177000 -0.032200 -0.000271 0.010600 0.017000 0.319000
Parameter Estimates
Estimate Std Error z value Pr(>|z|)
a 0.829 0.0320 25.90 0 V1 <--- theta
b 0.657 0.0325 20.23 0 V2 <--- theta
c 0.632 0.0325 19.43 0 V3 <--- theta
d 0.503 0.0340 14.80 0 V4 <--- theta
u 0.316 0.0346 9.12 0 V1 <--> V1
v 0.580 0.0334 17.35 0 V2 <--> V2
w 0.604 0.0337 17.94 0 V3 <--> V3
x 0.776 0.0382 20.31 0 V4 <--> V4
Iterations = 13
184 6 Constructs, Components, and Factor models

Table 6.17 The sem confirmatory factoring and factanal exploratory factoring yield identical results
for four congeneric measures. This is as would be expected given that there is no difference in the
parameters being estimated. Compare this to Table 6.18 where some parameters are constrained to be
equal.

> std.coef(sem.congeneric) #find the standardized path coefficients

Std. Estimate
a a 0.82766 V1 <--- theta
b b 0.65302 V2 <--- theta
c c 0.63068 V3 <--- theta
d d 0.49584 V4 <--- theta

> factanal(congeneric,factors=1)

Call:
factanal(factors = 1, covmat = Cov.congeneric, n.obs = 1000)
Uniquenesses:
V1 V2 V3 V4
0.315 0.574 0.602 0.754
Loadings:
Factor1
V1 0.828
V2 0.653
V3 0.631
V4 0.496
Factor1
SS loadings 1.755
Proportion Var 0.439
Test of the hypothesis that 1 factor is sufficient.
The chi square statistic is 0.46 on 2 degrees of freedom.
The p-value is 0.795

There are many goodness of fit statistics for SEM, partly in response to the problem that
the χ 2 statistic is so powerful (Marsh et al., 1988, 2005). The meaning of these statistics as
well as many more examples of confirmatory factor analysis will be considered in Chapter 10.

6.8 Alternative procedures for reducing the complexity of the data

In addition to factor analysis and principal components analysis, there are two more alterna-
tives to the problem of reducing the complexity of the data. One approach, multidimensional
scaling has been introduced before when considering how to represent groups of objects.
MDS considers the correlation matrix as a set of (inverse) distances and attempts to fit these
distances using a limited number of dimensions. The other alternative, cluster analysis, rep-
resents a large class of algorithms, some of which are particularly useful the practical problem
of forming homogeneous subsets of items from a larger group of items. Although typically
used for grouping subjects (objects), clustering techniques can also be applied to the group-
ing of items. As such, the resulting clusters are very similar to the results of components or
factor analysis.
6.8 Alternative procedures for reducing the complexity of the data 185

Table 6.18 Tau equivalence is a special case of congeneric tests in which factor loadings with the
true score are all equal, but the variables have unequal error variances. This model can be tested by
constraining the factor loadings to equality and then examining the fit statistics. Compare this result
to the case where the parameters are free to vary in the complete congeneric case (Table 6.16).

> model.allequal <- matrix(c("theta -> V1", "a", NA,"theta -> V2", "a",NA,"theta -> V3","a",NA,
+ "theta -> V4", "a", NA, "V1 <-> V1", "u", NA, "V2 <-> V2", "v", NA, "V3 <-> V3", "w", NA,
+ "V4 <-> V4", "x", NA, "theta <-> theta", NA, 1), ncol = 3, byrow = TRUE)
> sem.allequal = sem(model.allequal,S, 1000) #do the analysis
> summary(sem.allequal, digits = 3)

Model Chisquare = 56.1 Df = 5 Pr(>Chisq) = 7.64e-11


Chisquare (null model) = 910 Df = 6
Goodness-of-fit index = 0.974
Adjusted goodness-of-fit index = 0.947
RMSEA index = 0.101 90% CI: (0.0783, 0.126)
Bentler-Bonnett NFI = 0.938
Tucker-Lewis NNFI = 0.932
Bentler CFI = 0.943
SRMR = 0.088
BIC = 21.6

Normalized Residuals
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.160 -2.890 -0.967 -0.418 2.290 3.000

Parameter Estimates
Estimate Std Error z value Pr(>|z|)
a 0.668 0.0202 33.2 0 V1 <--- theta
u 0.448 0.0270 16.6 0 V1 <--> V1
v 0.565 0.0315 18.0 0 V2 <--> V2
w 0.576 0.0319 18.1 0 V3 <--> V3
x 0.730 0.0386 18.9 0 V4 <--> V4
Iterations = 10

6.8.1 MDS solutions remove the general factor

An alternative to factor analysis or components analysis is to analyze the correlation matrix


by using multiple dimensional scaling (MDS ). Metric and non-metric MDS techniques
attempt to maximize the fit (minimize the residual, an index of which is known as stress)
between a data matrix of distances and the distances calculated from a matrix of coordinates.
This was discussed in 2.7 and 4.3.
When working with variables sharing a large general factor such as general intelligence in
the cognitive domain, or neuroticism in the clinical domain, all the entries in the correlation
matrix will be large and thus, the average correlation will be large. Especially if orthogonal
rotations are used, it is difficult to see the structure of the smaller factors. A factor analytic
solution is to use oblique transformations, extract second order factors, and then plot the
loadings on the lower level factors. An alternative to this is to represent the correlations in
terms of deviations from the average correlation. One way to do this is to convert the corre-
lations to distances and do a multidimensional scaling of the distances. As was shown earlier
(Equation 4.11) distance between two variables is an inverse function of their correlation:
186 6 Constructs, Components, and Factor models

dxy = 2 ∗ (1 − rxy ). (6.23)

Applying formula 6.23 to the 24 ability measures of Holzinger-Swineford found in the Har-
man74.cor data set, results in a very interpretable two dimensional solution, with an implicit
general dimension removed (Figure 6.11). The data points have been shown using different
plotting symbols in order to show the cluster structure discussed below. Distances from the
center of the figure represent lower correlations with the general factor. It appears as if the
four factors of the factor analytic solution are represented in terms of the four quadrants of
the MDS solution.

Table 6.19 Multidimensional scaling of 24 mental tests. See Figure 6.11 for a graphical representation.
The correlations are transformed to distances using Equation 6.23. The multidimensional scaling uses
the cmdscale function.

> dis24 <- sqrt(2*(1-Harman74.cor$cov))


> mds24 <- cmdscale(dis24,2)
> plot.char <- c( 19, 19, 19, 19, 21, 21, 21, 21, 21, 20, 20, 20,
20, 23, 23, 23, 23, 23, 23, 19, 22, 19, 19, 22 )
> plot(mds24,xlim=c(-.6,.6),ylim=c(-.6,.6),xlab="Dimension 1",ylab="Dimension 2",asp=1,pch=plot.char)
> position <- c(2,2,3,4, 4,4,3,4, 3,2,3,2, 3,3,1,4, 4,1,3,1, 1,2,3,4)
> text(mds24,rownames(mds24),cex=.6,pos=position)
> abline(v=0,h=0)
> title("Multidimensional Scaling of 24 ability tests")
> #draw circles at .25 and .50 units away from the center
> segments = 51
> angles <- (0:segments) * 2 * pi/segments
> unit.circle <- cbind(cos(angles), sin(angles))
> lines(unit.circle*.25)
> lines(unit.circle*.5)

6.8.2 Cluster analysis – poor man’s factor analysis?

Another alternative to factor or components analysis is cluster analysis. The goal of cluster
analysis is the same as factor or components analysis (reduce the complexity of the data and
attempt to identify homogeneous subgroupings). Mainly used for clustering people or objects
(e.g., projectile points if an anthropologist, DNA if a biologist, galaxies if an astronomer),
clustering may be used for clustering items or tests as well. Introduced to psychologists by
Tryon (1939) in the 1930’s, the cluster analytic literature exploded in the 1970s and 1980s
(Blashfield, 1980; Blashfield and Aldenderfer, 1988; Everitt, 1974; Hartigan, 1975). Much
of the research is in taxonmetric applications in biology Sneath and Sokal (1973); Sokal
and Sneath (1963) and marketing (Cooksey and Soutar, 2006) where clustering remains
very popular. It is also used for taxonomic work in forming clusters of people in family
(Henry et al., 2005) and clinical psychology Martinent and Ferrand (2007); Mun et al. (2008).
Interestingly enough it has has had limited applications to psychometrics. This is unfortunate,
for as has been pointed out by e.g. Tryon (1935); Loevinger et al. (1953), the theory of factors,
while mathematically compelling, offers little that the geneticist or behaviorist or perhaps
6.8 Alternative procedures for reducing the complexity of the data 187

Multidimensional Scaling of 24 ability tests

0.6
Addition

0.4
SentenceCompletion
GeneralInformation
WordMeaning Code
ArithmeticProblems
PargraphComprehension
CountingDots
0.2

WordClassification
Dimension 2

StraightCurvedCapitals

WordRecognition
0.0

ObjectNumber
NumericalPuzzles

SeriesCompletion
ProblemReasoning FigureWord

Deduction NumberRecognition
-0.2

NumberFigure

Flags
VisualPerception
-0.4

FigureRecognition
Cubes

PaperFormBoard
-0.6

-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6

Dimension 1

Fig. 6.11 A multidimensional scaling solution removes the general factor from 24 ability measure-
ments. Distance from the center of the plot reflects distance from the general factor. Symbols reflect
the hierarchical clusters groupings seen in Figure 6.13.

even non-specialist finds compelling. Cooksey and Soutar (2006) reviews why the ICLUST
algorithm is particularly appropriate for scale construction in marketing.
As reviewed by Aldenderfer and Blashfield (1984); Blashfield (1980); Blashfield and Alden-
derfer (1988), there have been three main psychological approaches to cluster analysis. Tryon
(1939) and Loevinger et al. (1953), working in the psychometric tradition of factor analysis,
developed clustering algorithms as alternatives to factor analysis. Ward (1963), introduced
a goodness of fit measure for hierarchical clustering based upon an analysis of variance per-
spective. This was adapted by Johnson (1967), who was influenced by the multidimensional
scaling considerations of ordinal properties of distances (Kruskal, 1964) and considered hier-
archical clustering from a non-metric perspective.
At the most abstract, clusters are just ways of partitioning the data or the proximity
matrix. A clustering rule is merely a way of assigning n items to c clusters. For n= the
number of items or tests, and c=number of clusters, the goal is to assign each item to one
cluster. Thus, for a n item vector g, assign a cluster number, 1 ≤ cg ≤ c, to each gi . In terms
of scoring the data, this may be conceptualizes as finding a n * c matrix of keys, K, where
188 6 Constructs, Components, and Factor models

each ki j is 1, 0, or -1, that optimally divides the matrix. Two items are said to be in the same
cluster if they both have non-zero entries for that column of the K matrix. The keys matrix,
when multiplied by the N * n data matrix, X, will produce a scores matrix, Sxk , of dimension
N * c. This scores matrix is merely a set of scores on each of the c scales and represents the
sum of the items that define each scale. Such a way of scoring (just adding up the salient
items) is, in fact, what is typically done for factor or component analysis. The problem then is
how to find the grouping or clustering matrix, K. That is, what is the criterion of optimality?
As pointed out by Loevinger et al. (1953), unlike factor analysis which weights items by their
factor loadings, cluster analysis solves the practical problem of assigning items to tests on a
all or none basis.
There are two limits for the K matrix. One is just the identity matrix of rank n. That
is, no items are in the same cluster. The other is a one column matrix of all -1s or 1s.
That is, all items are in one cluster. The problem is thus to determine what is the optimal
number of clusters, c, and how to allocate items to these clusters. Optimality may be defined
in multiple ways, and different clustering algorithms attempt to maximize these different
criteria. Loevinger et al. (1953) tried to maximize the within cluster homogeneity in terms
similar to those of reliability theory. Ward (1963) evaluated clusters by a within cluster versus
between cluster reduction of variance estimate.
There are two broad classes of clustering algorithms: hierarchical and non-hierarchical .
Within hierarchical, there are also two broad classes: divisive and agglomerative. Finally,
there is the distinction between overlapping and non-overlapping clusters. That is, can an
item be in more than one cluster. Some of these distinctions will make more sense if we
consider two examples, one, developed by Loevinger et al. (1953) is non-hierarchial, the
other, implemented in the ICLUST algorithm is an agglomerative, hierarchical algorithm that
allocates items to distinct clusters and that was specifically developed for the problem of
forming homogeneous clusters of items (Revelle, 1979).

6.8.2.1 Non-hierarchical clustering

The process of forming clusters non-hierarchically was introduced for scale construction by
Loevinger et al. (1953). The steps in that procedure are very similar to those found in other
non-hierarchical algorithms.
1. Find the proximity (e.g. correlation) matrix,
2. Identify the most similar triplet of items,
3. Combine this most similar triplet of items to form a seed cluster,
4. Add items to this cluster that increase the cluster saturation,
5. Eliminate items that would reduce the cluster saturation,
6. Repeat steps 4 and 5 until no items remain. This is a homogeneous subtest.
7. Identify a new nucleus set of three items from the remaining items and start at step 3.
8. Repeat steps 4-7 until no items remain. The clusters are both maximally homogeneous
and maximally independent.
Step 1, finding a proximity matrix, requires a consideration of what is meant by being
similar. Typically, when constructing tests by combining subsets of items, an appropriate
index of similarity is the correlation coefficient. Alternatives include covariances, or perhaps
some inverse function of distance. A question arises as to whether to reverse score items that
are negatively correlated with the cluster. Although this makes good sense for items (“I do
6.8 Alternative procedures for reducing the complexity of the data 189

not feel anxious” when reversed scored correlates highly with “other people think of me as
very anxious”), it is problematic when clustering objects (e.g., people). For if two people are
very dissimilar, does that necessarily mean that the negative of one person is similar to the
other?
Steps 2 and 3, identifying the most similar triplet, requires both a theoretical examination
of the content as well as identifying that triplet that has the highest covariances with each
other. This set is then seen as the nucleus of the subtest to be developed. With suitably
created items, it is likely that the nucleus will be easily defined. The variance of this cluster
is just the sum of the item covariances and variances of the three items. Its covariance with
other items is just the sum of the separate covariances.
Steps 4 and 5, add items to maximize the cluster saturation and reject items that would
reduce it. Cluster saturation is merely the ratio of the sum of the item covariances or corre-
lations divided by the total test variance:

1� V1 − diag(V)
S= .
1� V1
For an n item test, saturation is a function of α (see 7.2.3):
n
α =S .
n−1
To prevent “functional drift” as the clusters are formed, items once rejected from a cluster
are not considered for that cluster at later stages.
Steps 4 and 5 are continued until S fails to increase.
Step 7 may be done sequentially, or, alternatively, multiple seed clusters could have been
identified and the process would proceed in parallel.
If the goal is to efficiently cluster a group of items around a specified set of k seeds, or
a randomly chosen set of k seeds, the kmeans function implements the Hartigan and Wong
(1979) algorithm. This does not provide the direct information necessary for evaluating the
psychometric properties of the resulting scales, but with a few steps, the cluster output can be
converted to the appropriate keys matrix and the items can be scored. That is, cluster2keys
takes the cluster output of a kmeans clustering and converts it to a keys matrix suitable
for score.items or cluster.cor. Applying the kmeans function to the Harman data set,
produces the cluster membership seen in Figure 6.11.
Although the kmeans algorithm works very well if the items are all positively correlated, the
solution when items need to be reversed is not very good. This may be seen when forming
clusters from items generated to represent both positive and negative loadings in a two
dimensional space (Table 6.20). The data were generated using the sim.item function which
defaults to producing items with two simple structured factors with high positive and negative
loadings. When just two clusters were extracted, the solution was very power (Clusters 2.1
and 2.2 are more correlated with each other than they are internally consistent.) When
four clusters are extracted they show the appropriate structure, with two sets of negatively
correlated clusters (e.g., C4.1, C4.3 and C4.2, C4.4) independent of each other. Compare
this solution to that shown in Table 6.21 where items were allowed to be reversed and a
hierarchical clustering algorithm was used. This difference is most obvious when plotting the
two cluster solutions from kmeans and ICLUST for 24 items representing a circumplex (see
Figure 6.12).
190 6 Constructs, Components, and Factor models

Table 6.20 kmeans of simulated two dimensional data with positive and negative loadings produces
two clusters that are highly negatively correlated and not very reliable (C2.1, C2.2). When four clusters
are extracted (C4.1 ... C4.4), the resulting keys produce two pairs of negatively correlated clusters,
with the pairs orthogonal to each other.

> set.seed(123)
> s2 <- sim.item(12)
> r2 <- cor(s2)
> k2 <- kmeans(r2,2)

> print(k2,digits=2)
K-means clustering with 2 clusters of sizes 6, 6
Cluster means:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
1 -0.16 -0.22 -0.22 -0.18 -0.17 -0.21 0.27 0.31 0.28 0.29 0.29 0.29
2 0.27 0.29 0.33 0.30 0.29 0.32 -0.18 -0.19 -0.19 -0.17 -0.21 -0.21
Clustering vector:
[1] 2 2 2 2 2 2 1 1 1 1 1
Within cluster sum of squares by cluster:
[1] 6.0 5.9
Available components:
[1] "cluster" "centers" "withinss" "size"

> keys2 <- cluster2keys(k2)


> k4 <- kmeans(r2,4)
> keys4 <- cluster2keys(k4)
> keys24 <- cbind(keys2,keys4)
> colnames(keys24) <- c("C2.1","C2.2","C4.1","C4.2","C4.3", "C4.4")
> cluster.cor(keys24,r2)

Call: cluster.cor(keys = keys24, r.mat = r2)

(Standardized) Alpha:
C2.1 C2.2 C4.1 C4.2 C4.3 C4.4
0.50 0.54 0.64 0.62 0.64 0.66

(Standardized) G6*:
C2.1 C2.2 C4.1 C4.2 C4.3 C4.4
0.59 0.62 0.59 0.58 0.60 0.61

Average item correlation:


C2.1 C2.2 C4.1 C4.2 C4.3 C4.4
0.14 0.16 0.37 0.36 0.37 0.39

Number of items:
C2.1 C2.2 C4.1 C4.2 C4.3 C4.4
6 6 3 3 3 3

Scale intercorrelations corrected for attenuation


raw correlations below the diagonal, alpha on the diagonal
corrected correlations above the diagonal:
C2.1 C2.2 C4.1 C4.2 C4.3 C4.4
C2.1 0.50 -1.25 1.25 1.25 -0.85 -0.78
C2.2 -0.65 0.54 -0.80 -0.78 1.22 1.21
C4.1 0.71 -0.47 0.64 0.00 -0.03 -1.00
C4.2 0.70 -0.45 0.00 0.62 -1.05 0.02
C4.3 -0.48 0.71 -0.02 -0.66 0.64 0.04
C4.4 -0.45 0.72 -0.65 0.01 0.03 0.66
6.8 Alternative procedures for reducing the complexity of the data 191

kmeans clusters ICLUST clusters


24 circumplex variables 24 circumplex variables

0.4

0.5
123 19 18 17 16
24
23
4
0.2 5
22 20
15
21 14
216 22 13
207 12

C19

0.0
2

23
0.0

19 24 11
8
9
18 10
1
9
-0.2

17
10
11 2 8

-0.5
16 3
15
13
1214 4 5 6 7
-0.4

-0.2 0.0 0.1 0.2 0.3 -0.4 0.0 0.2 0.4 0.6

1 C22

Fig. 6.12 When clustering items, it is appropriate to reverse key some items. When forming two
clusters from data with a circumplex structure, the kmeans algorithm, which does not reverse key
items, forms two highly negatively correlated clusters (left hand panel) while the ICLUST algorithm,
which does reverse items, captures the two dimensional structure of the data (right hand panel).

6.8.2.2 Hierarchical clustering

Hierarchical cluster analysis forms clusters that are nested within clusters. The resulting
tree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows the
nesting structure. Although there are many hierarchical clustering algorithms (e.g., agnes,
hclust, and ICLUST), the one most applicable to the problems of scale construction is ICLUST
(Revelle, 1979).
1. Find the proximity (e.g. correlation) matrix,
2. Identify the most similar pair of items
3. Combine this most similar pair of items to form a new variable (cluster),
4. Find the similarity of this cluster to all other items and clusters,
5. Repeat steps 2 and 3 until some criterion is reached (e.g., typicallly, if only one cluster
remains or in ICLUST if there is a failure to increase reliability coefficients α or β ).
6. Purify the solution by reassigning items to the most similar cluster center.
Just as for non-hierarchical problems, Step 1 requires a consideration of what is desired
and will probably mean finding a correlation matrix.
Step 2, identifying the most similar pair might include adjusting the proximity matrix to
correct for reliability of the measures. As discussed in Step 1 of the non-hierarchical question
of whether to reverse key an item when considering similarity is very important. When
192 6 Constructs, Components, and Factor models

clustering items, it is probably a good idea to allow items to be reversed keyed. But when
clustering objects, this is not as obvious.
Step 3, combining items i and j into a new cluster means adjusting the K matrix by forming
a new column,ci j , made up of the elements of ci and c j and then deleting these two columns.
Step 4, finding the similarity of the cluster with the remaining items can be done in many
different ways. Johnson (1967), took a non-metric approach and considered the distance
between two clusters to be defined by the nearest neighbor (also known as single linkage),
the furthest neighbor (maximum distance or complete linkage). Metric techniques include
considering the centroid distance or the correlation. Each of these choices may lead to different
solutions. single linkage will lead to chaining, complete linkage will lead to very compact
clusters.
Step 5, repeating steps 2 and 3 until some criterion is reached, is the essence of hierarchical
clustering. Rather than just adding single items into clusters, hierarchical clustering allows
for clusters to form higher level clusters. The important question is whether it is a good idea
to combine clusters. Psychometric considerations suggest that clusters should be combined
a long as the internal consistency of the cluster increases. There are at least three measures
of internal consistency worth considering: the average inter-item correlation, the percentage
of cluster variance that is reliable (e.g., coefficient α, (Cronbach, 1951)) and the percentage
of cluster variance associated with one common factor (e.g., coefficient β , (Revelle, 1979;
Zinbarg et al., 2005)). While the average r will not increase as higher order clusters are
formed, both α and β typically will. However, α will increase even if the clusters are barely
related, while β is more sensitive to over clustering and will not increase as frequently. To
the extent that the cluster is meant to represent one construct, the β criterion seems more
justified.
Step 6, purifying the clusters, is done only in the case that more than one cluster is
identified. It is sometimes the case that hierarchically formed clusters will include items that
are more similar to other clusters. A simple one or two stage purification step of reassigning
items to the most similar cluster will frequently remove this problem.
It is interesting to compare the ICLUST cluster solution to the Harman-Holzinger-Swinford
24 mental measures (Figure 6.13) to the cmdscale multidimensional scaling solution in Fig-
ure 6.11. Cluster C20 corresponds to the lower right quadrant of the MDS, C17 to the lower
and C10 to the upper region of the lower left quadrant (which then combine into C18), and
C13 to the upper left quadrant. C15 groups the variables in the upper region of the upper
right quadrant.
Cluster analysis results bear a great deal of similarity to early factor analytic work (e.g.,
Holzinger (1944)) which used various shortcuts to form groups of variables that were of ap-
proximately rank one within groups (clusters), but had some correlations between groups.
The item by cluster loadings matrix is similar to a factor structure matrix and thus may be
converted to a cluster pattern matrix by multiplying by the inverse of the cluster correla-
tions. At least for the Harman 24 mental ability measures, it is interesting to compare the
cluster pattern matrix with the oblique rotation solution from a factor analysis. The factor
congruence of a four factor oblique pattern solution with the four cluster solution is > .99
for three of the four clusters and > .97 for the fourth cluster.
6.8 Alternative procedures for reducing the complexity of the data 193

Table 6.21 Applying the ICLUST algorithm to the circumplex data set of Table 6.20 produces two
orthogonal clusters with adequate reliability.

> ic <- ICLUST(r2)


> ic
> cluster.plot(ic)

ICLUST (Item Cluster Analysis)


Call: ICLUST(r.mat = r2)

Purified Alpha:
C10 C9
0.79 0.78

G6* reliability:
C10 C9
0.76 0.76

Original Beta:
C10 C9
0.74 0.74

Cluster size:
C10 C9
6 6

Item by Cluster Structure matrix:


C10 C9
[1,] 0.10 -0.63
[2,] -0.05 -0.60
[3,] -0.12 -0.59
[4,] -0.59 -0.01
[5,] -0.58 0.02
[6,] -0.66 -0.03
[7,] -0.04 0.59
[8,] 0.06 0.57
[9,] -0.03 0.62
[10,] 0.56 -0.01
[11,] 0.60 0.02
[12,] 0.62 0.02

With eigenvalues of:


C10 C9
2.2 2.2

Purified scale intercorrelations


reliabilities on diagonal
correlations corrected for attenuation above diagonal:
C10 C9
C10 0.79 0.01
C9 0.01 0.78
194 6 Constructs, Components, and Factor models

ICLUST of 24 mental abilities

Addition
0.76
0.76
CountingDots C4
0.86
Code
0.73
0.73 C5 0.89 C15
StraightCurvedCapitals
0.66
WordClassification C23
0.83 0.8
GeneralInformation C13 0.68 C22
0.85 0.8
0.85 C2 0.95 C9
WordMeaning
0.95
PargraphComprehension 0.85 C3
0.85
SentenceCompletion

ProblemReasoning 0.96

Deduction 0.8
0.71
0.71 C1 0.84 C10
SeriesCompletion
0.78
NumericalPuzzles 0.71
0.71 C11 0.91 C12
0.73
ArithmeticProblems
C18 0.72 C21
0.9
VisualPerception 0.69
0.69 C7 0.78 C17
Flags 0.95
0.81
Cubes 0.64
0.64 C16

PaperFormBoard

NumberRecognition 0.71
0.71 C14 0.83 C20
WordRecognition 0.66
0.66 C8
0.82
FigureRecognition

FigureWord 0.7 C19


0.83
ObjectNumber 0.67
0.67 C6

NumberFigure

Fig. 6.13 The ICLUST hierarchical cluster analysis algorithm applied to 24 mental tests. The cluster
structure shows a clear one cluster (factor) solution, with four clear subclusters ( C13, C15, C18, and C
20). Graph created by ICLUST(Harman74.cor$cov). Compare this solution to that show in Figure 6.11.

Table 6.22 For the Harman 24 mental measurements the oblique pattern matrix from a factor analysis
and the pattern matrix from ICLUST are practically identical as judged by their factor congruence.

> clust24 <- ICLUST(Harman74.cor$cov,4) #cluster the 24 mental abilities


> clust.pattern <- clust24$pattern
> f24 <- factanal(covmat=Harman74.cor,factors=4,rotation="oblimin")
> round(factor.congruence(clust.pattern,ob24),2)

Factor1 Factor2 Factor3 Factor4


C18 0.30 0.97 0.07 0.26
C13 0.98 0.05 0.01 0.08
C20 0.19 0.25 0.21 0.99
C15 0.09 0.21 0.99 0.22
6.9 Estimating factor scores, finding component and cluster scores 195

6.9 Estimating factor scores, finding component and cluster scores

Factors are latent variables representing the structural relationships between observed vari-
ables. Although we can not directly measure the factors, we can observe the correlations
between the f factors and the v observed variables. This correlation matrix is known as
the pattern matrix , v P f . More typically interpreted, particularly with an oblique solution,
is the structure matrix , v F f , of beta weights for predicting the variables from the factors.
The resulting factor loadings, communalities, and uniqueness are all defined by the model.
As long as the number of free parameters (loadings) that need to be estimated are exceeded
by the number of correlations, the model is defined. This is not the case, however, for factor
scores. There will always be more free parameters to estimate than there will be observed
data points. This leads to the problem of factor indeterminacy.
Given the data matrix, n Xv with elements Xi j , we can find the deviation matrix, n xv and
the covariance matrix between the observed variables, v Cv . This matrix may be reproduced
by the product of the factor matrix and its transpose. See Equation 6.3 which is reproduced
here with subscripts to show the number of observed covariances ( v(v−1) 2 ) and the number of
f ∗( f −1)
parameters to find (vk − 2 ):

v Cv ≈v F f . f F�v +v U2v (6.24)


where the v U2v values are found by subtracting the diagonal of the modeled covariances from
the diagonal of the observed covariances.
We need to find factor scores, n S f , and the uniquenesses, n Uv such that

n xv =n S f . f F�v +n Uv = (v F f . f S�n +v Un )� (6.25)

and

v xn .n xv /n = (v F f . f S�n )(n S f . f F�v )/n =v Cv .
Dropping the subscripts to make the equation more legible we find

C = x� x/n = FS� SF�

and because the factors are orthogonal, S� S = I,

C = x� x/n = FF� .

Unfortunately, the number of free parameters in Equation 6.25 is nv + n f + f v which, of


course, exceeds the number of observed data points nv. That is, we have more unknowns
than knowns and there are an infinity of possible solutions. We can see the problem more
directly when solving solving equation 6.25 for Ŝ. Post-multiplying each side of the equation
by v F f leads to
(n xv −n Uv )v F f = n Sˆ f . f F�v .v F f = n Sˆ f . f C f
and thus
Ŝ = (x − U)FC−1 = xFC−1 − UFC−1 . (6.26)
The problem of finding factor score estimates, Ŝ, is that while there is a observable part,
xFC−1 , of the score S, there is also an unobservable part, UFC−1 . Unless the communalities
are one and therefore the uniquenesses are zero (that is, unless we are doing a components
196 6 Constructs, Components, and Factor models

analysis, the factor scores are indeterminant, although a best estimate of the factor scores
(in terms of a least squares regression) will be

Ŝ = xFC−1 = xW

where
W = FC−1 (6.27)
is just a matrix of the β weights for estimating factor scores from the observed variables and
R2 between these estimates and factors is

R2 = diag(WF). (6.28)

Unfortunately, even for uncorrelated factors, the regression based weights found by Equa-
tion 6.27 will not necessarily produce uncorrelated factor score estimates. Nor, if the factors
are correlated, will the factor scores have the same correlations. Gorsuch (1983) and Grice
(2001) review several alternative ways to estimate factor scores, some of which will preserve
orthogonality if the factors are orthogonal, others that will preserve the correlations between
factors.
The regression solution (Equation 6.27) will produce factor score estimates that are most
correlated with the factors, but will not preserve the factor intercorrelations (or, in the case
of orthogonal factors, their orthogonality).
Harman (1976) proposed to weight the factor loadings based upon idealized variables by
finding the inverse of the inner product of the factor loadings:

k Wv = (k Fv � v Fk )−1 k Fv . (6.29)

A somewhat different least squares procedure was proposed by Bartlett (1937) to minimize
the contribution of the unique factors:

W = U−2 F(F� U−2 F)−1 . (6.30)

A variation of Equation 6.30 proposed by Anderson and Rubin (1956) requires that the
factors be orthogonal, and preserves this orthogonality:

W = U−2 F(F� U−2 CU−2 F)−1/2 . (6.31)

This solution was generalized to the oblique factors by McDonald (1981) and then extended
by ten Berge et al. (1999): For a structure matrix, F, with intercorrelations of the factors, Φ,
then let L = FΦ 1/2 , and D = R1/2 L(L� C−1 L)−1/2 , then

W = C−1/2 DΦ 1/2 . (6.32)

Two of these scoring procedures are available in the factanal function (regression scores
and Bartlett’s weighted least squares scores. All five may be found using the fa or fac-
tor.scores functions in psych.
So why are there so many ways to estimate factor scores? It is because factor scores
are underidentified and all of these procedures are approximations. Factor scores represent
estimates of the common part of the variables and are not identical to the factors themselves.
If a factor score estimate is thought of as a chop stick stuck into the center of an ice cream
6.9 Estimating factor scores, finding component and cluster scores 197

cone and possible factor scores are represented by straws anywhere along the edge of the
cone the problem of factor indeterminacy becomes clear, for depending on the shape of the
cone, two straws can be negatively correlated with each other ( Figure 6.14).

The problem of factor indeterminacy

^
S ^
S

S1 S2 S1 S2

2! 2!

!1 !2 !1 !2

Fig. 6.14 Factor indeterminacy may be thought of as a chop stick (Ŝ, the factor score estimates)
inserted into an ice cream cone (the infinite set of factor scores that could generate the factor score
estimates). In both figures, the vertical arrow is the factor score estimate, Ŝ. The dashed arrows
represent two alternative true factor score vectors, S1 and S2 that both correlate cos(θi ) with the factor
score estimate but cos(2θ ) = 2cos(θ )2 − 1 with each other. The imagery is taken from Niels Waller,
adapted from Stanley Mulaik. The figure is drawn using the dia.cone function.

In a very clear discussion of the problem of factor score indeterminacy, Grice (2001)
reviews these alternative ways of estimating factor scores and considers weighting schemes
that will produce uncorrelated factor score estimates as well as the effect of using course coded
(unit weighted ) factor weights. From the trigonometric identity that cos(2θ ) = cos(θ )2 − 1
it follows that the minimum correlation of any two factor score vectors with each other is
2R2 − 1. That is, if the factor score estimates correlates .707 with the factor, then two sets of
actual factor scores associated with these estimates could be uncorrelated! It is important to
examine this level of indeterminacy for an R2 < .5 implies that two estimates actually can be
negatively correlated. The R2 and minimum correlation possible between two factor scores
198 6 Constructs, Components, and Factor models

are reported in the output of the fa, factor.scores and factor.stats functions and may
be seen in Tables 6.7 and 6.12.
It is appropriate to ask what affects the correlation of the factors with their factor scores?
Just as with any regression, the larger the beta weights and the greater the number of
variables, the better the fit. Examining Equation 6.26 it is clear that the score estimates
will better match the factor scores as the contribution of the unique part of each variable
diminishes. That is, higher communalities will decrease the indeterminacy. The limit of this
is just a components model, in which component scores are perfectly defined by the data.
A demonstration that factor scores estimates are correlated with but not identical to
the true factor scores can be shown by simulation. Consider the four variables examined in
Table 6.16. The sim.congeneric function that generated the observed scores did so by first
generating the latent variable and then adding error. Thus, it is possible to correlate the
estimated scores with the “true” (but simulated) factor score (Table 6.23). The correlation,
although large (.88) is not unity. It is the square root of the reliability of the factor score
estimate.
Clusters and principal components, on the other hand, are defined. They are just weighted
linear sums of the variables and thus are found by addition. Both are defined by

S = xW

where the weights matrix, W, is formed of -1, 0, or 1s in the case of clusters, and are β
weights found from the loadings and the inverse of the correlation matrix (Equation 6.27) in
the case of principal components. In the case of the four correlated variables in Table 6.23,
these two ways of finding scores are very highly correlated (.9981) as they are with the factor
score estimates (.9811 and .9690 for components and clusters, respectively).
The differences between the various estimation procedures are shown clearly in Table 6.24
where 1000 subjects were simulated with a three correlated factor structure. Although the
three factor solutions are identical (not shown), the factor scores estimates based upon the
regression, Bartlett, or ten Berge methods are very different. Also shown are scores based
upon a principal components solution and the simple expedient of unit weighting the items
(that is, just adding them up).
That factor scores are indeterminate has been taken by some (e.g., Schonemann, 1996) to
represent psychopathology on the part of psychometricians, for the problem of indeterminacy
has been known (and either ignored or suppressed) since Wilson (1928). To others, factor
indeterminacy is a problem at the data level but not the structural level, and at the data
level it is adequate to report the degree of indeterminacy. This degree of indeterminacy is
indeed striking (Schonemann and Wang, 1972; Velicer and Jackson, 1990b), and should be
reported routinely. It is reported for each factor in the fa and omega functions.

6.9.1 Extending a factor solution to new variables

It is sometimes the case that factors are derived from a set of variables (the vecFo factor
loadings) and we want to see what the loadings of an extended set of variables Fe would be.
Given the original correlation matrix Ro and the correlation of these original variables with
the extension variables of Roe , it is a straight forward calculation to find the loadings Fe of the
extended variables on the original factors. This technique was developed by Dwyer (1937)
6.9 Estimating factor scores, finding component and cluster scores 199

Table 6.23 Factor scores are estimated by equation 6.26 and are correlated with the “true” factors.
Although the true scores are not normally available, by doing simulations it is possible to observe the
correlation between true and estimated scores. The first variable is the latent factor score, e1 ... e4
are the error scores that went into each observed score, V1 ... V4 are the observed scores, MR1 is the
factor score estimate, PC1 is the principal component score, A1 is the cluster score. Note the very high
correlation between the factor estimates and principal component and cluster score, and the somewhat
lower, but still quite large correlations of all three with the true, latent variable, theta.

> set.seed(42)
> test <- sim.congeneric(N=500,short=FALSE,loads = c(.8,.7,.6,.5))
> f1=fa(test$observed,1,scores=TRUE)
> c.scores <- score.items(rep(1,4),test$observed)
> pc1 <- principal(test$observed,1,scores=TRUE)
> scores.df <- data.frame(test$latent,test$observed,f1$scores,pc1$scores,c.scores$scores)
> round(cor(scores.df),2)

theta e1 e2 e3 e4 V1 V2 V3 V4 MR1 PC1 A1


theta 1.00 -0.01 0.04 0.00 -0.04 0.78 0.72 0.59 0.47 0.88 0.88 0.87
e1 -0.01 1.00 0.00 -0.02 -0.01 0.62 -0.01 -0.02 -0.02 0.35 0.22 0.20
e2 0.04 0.00 1.00 -0.04 -0.08 0.03 0.73 -0.01 -0.05 0.25 0.26 0.24
e3 0.00 -0.02 -0.04 1.00 0.00 -0.01 -0.03 0.81 0.00 0.16 0.25 0.26
e4 -0.04 -0.01 -0.08 0.00 1.00 -0.04 -0.08 -0.02 0.87 0.06 0.19 0.24
V1 0.78 0.62 0.03 -0.01 -0.04 1.00 0.56 0.45 0.35 0.91 0.83 0.81
V2 0.72 -0.01 0.73 -0.03 -0.08 0.56 1.00 0.40 0.28 0.79 0.78 0.76
V3 0.59 -0.02 -0.01 0.81 -0.02 0.45 0.40 1.00 0.27 0.65 0.72 0.72
V4 0.47 -0.02 -0.05 0.00 0.87 0.35 0.28 0.27 1.00 0.50 0.60 0.65
MR1 0.88 0.35 0.25 0.16 0.06 0.91 0.79 0.65 0.50 1.00 0.98 0.97
PC1 0.88 0.22 0.26 0.25 0.19 0.83 0.78 0.72 0.60 0.98 1.00 1.00
A1 0.87 0.20 0.24 0.26 0.24 0.81 0.76 0.72 0.65 0.97 1.00 1.00

for the case of adding new variables to a factor analysis without doing all the work over
again and extended by Mosier (1938) to the multiple factor case. But, factor extension is also
appropriate when one does not want to include the extension variables in the original factor
analysis, but does want to see what the loadings would be anyway (Horn, 1973; Gorsuch,
1997). Logically, this is correlating factor scores with these new variables, but in fact can be
done without finding the factor scores.
Consider the super matrix R made up of the correlations of the variables within the original
set, Ro , the correlations between the variables in the two sets, Roe , and the correlations of the
variables within the second set, Ro (the left hand part of Equation 6.33. This can be factored
the normal way (the right hand part of the equation with θ being the factor intercorrelatons).

 
.. �  
R
 o . R oe  Fo � �
R= . . . . . . . .  =  . . θ .. � + U2 (6.33)
  Fo . Fe

.. Fe
R . R oe e

Now Ro = Fo θ F�o and Roe = Fo θ F�e . Thus,

Fe = Roe Fo (F�o Fo )−1 θ −1 . (6.34)


200 6 Constructs, Components, and Factor models

Table 6.24 9 variables were generated with a correlated factor structure. Three methods of factor
score estimation produce very different results. The regression scoring method does not produce factor
score estimates with the same correlations as the factors and does not agree with the estimates from
either Bartlett or tenBerge. The tenBerge estimates have the identical correlations as the factors. The
last two methods are component scores (RC1 ... RC3) and unit weighted “cluster” scores.

> set.seed(42)
> v9 <- sim.hierarchical(n=1000,raw=TRUE)$observed
> f3r <- fa(v9,3,scores="regression") #the default
> f3b <- fa(v9,3,scores="Bartlett")
> f3t <- fa(v9,3,scores="tenBerge")
> p3 <- principal(v9,3,scores=TRUE,rotate="oblimin")
> keys <- make.keys(9,list(C1=1:3,C2=4:6,C3=5:7))
> cluster.scores <- score.items(keys,v9)
> scores <- data.frame(f3r$scores,f3b$scores,f3t$scores,p3$scores,cluster.scores$scores)
> round(cor(scores),2)

MR1 MR2 MR3 MR1.1 MR2.1 MR3.1 MR1.2 MR2.2 MR3.2 RC1 RC3 RC2 C1 C2 C3
MR1 1.00 -0.39 -0.23 0.87 0.00 0.00 0.81 0.14 0.14 0.94 -0.16 -0.09 0.85 0.01 0.00
MR2 -0.39 1.00 -0.14 0.00 0.89 0.00 0.09 0.80 0.08 -0.21 0.93 -0.08 0.02 0.86 0.58
MR3 -0.23 -0.14 1.00 0.00 0.00 0.94 0.07 0.06 0.85 -0.12 -0.07 0.92 0.04 0.04 0.45
MR1.1 0.87 0.00 0.00 1.00 0.44 0.32 0.99 0.58 0.48 0.93 0.25 0.17 0.99 0.45 0.43
MR2.1 0.00 0.89 0.00 0.44 1.00 0.26 0.53 0.98 0.39 0.21 0.95 0.12 0.46 0.98 0.77
MR3.1 0.00 0.00 0.94 0.32 0.26 1.00 0.39 0.35 0.98 0.15 0.14 0.93 0.35 0.29 0.64
MR1.2 0.81 0.09 0.07 0.99 0.53 0.39 1.00 0.66 0.56 0.89 0.34 0.24 0.99 0.54 0.53
MR2.2 0.14 0.80 0.06 0.58 0.98 0.35 0.66 1.00 0.50 0.35 0.90 0.20 0.60 0.97 0.80
MR3.2 0.14 0.08 0.85 0.48 0.39 0.98 0.56 0.50 1.00 0.31 0.25 0.88 0.52 0.42 0.72
RC1 0.94 -0.21 -0.12 0.93 0.21 0.15 0.89 0.35 0.31 1.00 0.00 0.00 0.93 0.22 0.20
RC3 -0.16 0.93 -0.07 0.25 0.95 0.14 0.34 0.90 0.25 0.00 1.00 0.00 0.27 0.95 0.76
RC2 -0.09 -0.08 0.92 0.17 0.12 0.93 0.24 0.20 0.88 0.00 0.00 1.00 0.19 0.14 0.43
C1 0.85 0.02 0.04 0.99 0.46 0.35 0.99 0.60 0.52 0.93 0.27 0.19 1.00 0.47 0.47
C2 0.01 0.86 0.04 0.45 0.98 0.29 0.54 0.97 0.42 0.22 0.95 0.14 0.47 1.00 0.84
C3 0.00 0.58 0.45 0.43 0.77 0.64 0.53 0.80 0.72 0.20 0.76 0.43 0.47 0.84 1.00

Consider a case of twelve variables reflecting two correlated factors with the first six
variables loading on the first factor, and the second six on the second factor. Let variables
3, 6, 9, and 12 form the extension variables. Factor the remaining variables and then extend
these factors to the omitted variables using fa.extension (Table 6.25). This may be shown
graphically using the fa.diagram function (Figure 6.15). The loadings are

6.9.2 Comparing factors and components – part 2

At the structural level, there are major theoretical differences between factors, components,
and clusters. Factors are seen as the causes of the observed variables, clusters and components
are just summaries of the observed variables. Factors represent the common part of variables,
clusters and components all of the variables. Factors do not change with the addition of new
variables, clusters and components do. Factor loadings will tend to be smaller than component
loadings and in the limiting case of unrelated variables, can be zero, even though component
loadings are large. At the scores level, factor scores are estimated from the data, while clusters
and components are found.
6.9 Estimating factor scores, finding component and cluster scores 201

Table 6.25 Create 12 variables with a clear two factor structure. Remove variables 3, 6, 9, and 12 from
the matrix and factor the other variables. Then extend this solutio to the deleted variables. Compare
this solution to the solution with all variables included (not shown). A graphic representation is in
Figure 6.15.

set.seed(42)
fx <- matrix(c(.9,.8,.7,.85,.75,.65,rep(0,12),.9,.8,.7,.85,.75,.65),ncol=2)
Phi <- matrix(c(1,.6,.6,1),2)
sim.data <- sim.structure(fx,Phi,n=1000,raw=TRUE)
R <- cor(sim.data$observed)
Ro <- R[c(1,2,4,5,7,8,10,11),c(1,2,4,5,7,8,10,11)]
Roe <- R[c(1,2,4,5,7,8,10,11),c(3,6,9,12)]
fo <- fa(Ro,2)
fe <- fa.extension(Roe,fo)
fa.diagram(fo,fe=fe)

Call: fa.extension(Roe = Roe, fo = fo)


Standardized loadings based upon correlation matrix
MR1 MR2 h2 u2
V3 0.69 0.01 0.49 0.51
V6 0.66 -0.02 0.42 0.58
V9 0.01 0.66 0.44 0.56
V12 -0.06 0.70 0.44 0.56

MR1 MR2
SS loadings 0.89 0.89
Proportion Var 0.22 0.22
Cumulative Var 0.22 0.45

With factor correlations of


MR1 MR2
MR1 1.00 0.62
MR2 0.62 1.00

Unfortunately, many of the theoretical distinctions between factor analysis and compo-
nents analysis are lost when one of the major commercial packages claims to be doing factor
analysis by doing principal components.
Basic concepts of reliability and structural equation modeling are built upon the logic of
factor analysis and most discussions of theoretical constructs are discussions of factors rather
than components. There is a reason for this. The pragmatic similarity between the three sets
of models should not be used as an excuse for theoretical sloppiness.
In the next three chapters, the use of latent variable models using factor analysis and
related concepts will be applied to the problem of how well is a trait measured (issues in
reliability), alternative models for determining latent variables (item response theory), and
the use of latent variable models to estimate the structural properties of the data (structural
equation modeling). All of the techniques rely on a basic understanding of latent variable
modeling and differ only in how the latent variables are estimated.
202 6 Constructs, Components, and Factor models

Factor analysis and extension

V1

V4 0.9
0.8
V2 0.8 MR1 0.7 V3
0.8 0.7
V5 V6
0.6
V7 V12
0.9 0.7
V10 0.9 MR2 0.7 V9
0.8
V8 0.7

V11

Fig. 6.15 Factor extension projects new variables into a factor solution from an original set of vari-
ables. In this simulated data set (see Table 6.25, 12 variables were generated to reflect two correlated
factors. Variables 3, 6, 9, and 12 were removed and the remaining variables were factored using fa. The
loadings of the removed variables on these original factors were then calculated using fa.extension.
Compare these extended loadings to the loadings had all the variables been factored toegether.
Chapter 7
Classical Test Theory and the Measurement of
Reliability

Whether discussing ability, affect, or climate change, as scientists we are interested in the
relationships between our theoretical constructs. We recognize, however, that our measure-
ments are not perfect and that any particular observation has some unknown amount of error
associated with that measurement for “all measurement is befuddled by error” (McNemar,
1946, p 294). When estimating central tendencies, the confidence interval of the mean may
be estimated by the standard error of the sample observations, and may be calculated from
the observed standard deviation and the number of observations. This is, of course, the basic
concept of Gossett and Fisher.
Error may be both random as well systematic. Random error reflects trial by trial variabil-
ity due to unknown sources while systematic error may reflect situational or individual effects
that may be specified. Perhaps the classic example of systematic error (known as the personal
equation) is the analysis of individual differences in reaction time in making astronomical ob-
servations. “The personal equation of an observer is the interval of time which habitually
intervenes between the actual and the observed transit of a star...” (Rogers, 1869). Before
systematic individual differences were analyzed, the British astronomer Maskelyn fired his as-
sistant Kennebrook for making measurements that did not agree with his own (Stigler, 1986).
Subsequent investigations of the systematic bias (Safford, 1898; Sanford, 1889) showed con-
sistent individual differences as well as the effect of situational manipulations such as hunger
and sleep deprivation (Rogers, 1869).
Systematic error may be removed. What about the effect of random error? When estimat-
ing individual scores we want to find the standard error for each individual as welll as the
central tendency for that individual. More importantly, if we want to find the relationship
between two variables, the errors of observation will affect the strength of the correlation be-
tween them. Charles Spearman (1904b) was the first psychologist to recognize that observed
correlations are attenuated from the true correlation if the observations contain error.
Now, suppose that we wish to ascertain the correspondence between a series of values, p, and
another series, q. By practical observation we evidently do not obtain the true objective values, p
and q, but only approximations which we will call p’ and q’. Obviously, p’ is less closely connected
with q’, than is p with q, for the first pair only correspond at all by the intermediation of the
second pair; the real correspondence between p and q, shortly r pq has been ”attenuated” into r p� q�
(Spearman, 1904b, p 90).

This attenuation of the relationship between p and q may be seen graphically in Figure 7.1
panel A. An alternative way of considering this is examine the effect of combining true scores

205
206 7 Classical Test Theory and the Measurement of Reliability

(dashed line) with error scores (light solid lines) to produce observed score (heavy solid line)
as shown in Figure 7.2.

A e1 e2

rep' req'
rp'q'

p'1 q'1

rpp' rqq'
rpq

p q

ep1 eq1
B

rep' rp'q' req'

p'1 q'1

rpp' rqq'
rpq

rp'p' rq'q'
p q

rpp' rqq'

p'2 q'2

rep' req'

ep2 eq2

Fig. 7.1 Spearman‘s model of attenuation and reliability. Panel A: The true relationship between p
and q is attenuated by the error in p’ and q’. Panel B: the correlation between the latent variable p
and the observed variable p’ may be estimated from the correlation of p’ with a parallel test.
7.1 Reliability and True Scores 207

7.1 Reliability and True Scores

The classic model of reliability treats an observed score, p’, as made up of two independent
components: the latent true score, p, and a latent error score, e (Figure 7.1 panel A). Errors
are “accidental deviations [that] are different in every individual case (hence are often called
the ‘variable errors’) and occur quite impartially in every direction according to the known
laws of probability” (Spearman, 1904b, p 76), and may be seen as randomly “augmenting and
diminishing” observed values, and “tending in a prolonged series to always more and more
perfectly counterbalance one another” (Spearman, 1904b, p 89).
Consider a simple case of asking students their ages but to insure privacy, asking them to
flip a coin before giving their answer. If the coin comes up heads, they should add 1 to their
real age, if it comes up tails, they should subtract 1. Clearly no observed score corresponds
to the true scores. But if we repeat this exercise 10 times, then the mean for each student
will be quite close to their true age. Indeed, as the number of observations per student
increases, the mean score will tend towards their true score with a precision based upon the
inverse of the square root of the number of observations. True score can then be defined as
the expected value, over multiple trials, of the observed score (Figure 7.2). Unfortunately, if
errors are systematically biased in one direction or another, this definition of true score will
not produce Platonic Truth. (The classic example is if one is attempting to determine the sex
of young chickens, errors will not have an expected value of zero, but rather will be slightly
biased towards the other sex Lord and Novick (1968)).
Using more modern notation by replacing p’ with x (for observed score) and p with t (for
true score), then each individual score, x, reflects a true value, t, and an error value, e, and
the expected score over multiple observations of x is t, and the expected score of e for any
value of p is 0. Then, because the expected error score is the same for all true scores, the
covariance of true score with error score (σte ) is zero, and the variance of x, σx2 , is just

σx2 = σt2 + σe2 + 2σte = σt2 + σe2 .

Similarly, the covariance of observed score with true score is just the variance of true score

σxt = σt2 + σte = σt2

and the correlation of observed score with true score is


σxt σ2 σt
ρxt = � =� t = . (7.1)
2 2
(σt + σe )(σt )
2 σx σt
2 2 σx

By knowing the correlation between observed score and true score, ρxt , and from the definition
of linear regression (Eqs. 4.2,4.6) predicted true score, tˆ, for an observed x may be found from

σt2
tˆ = bt.x x = x = ρxt2 x. (7.2)
σx2

All of this is well and good, but to find the correlation we need to know either σt2 or σe2 . The
question becomes how do we find σt2 or σe2 ?.
208 7 Classical Test Theory and the Measurement of Reliability

Reliability = .80

0.6
Probability of score

0.4
0.2
0.0

-3 -2 -1 0 1 2 3

score

Reliability = .50
0.6
Probability of score

0.4
0.2
0.0

-3 -2 -1 0 1 2 3

score

Fig. 7.2 Observed score (heavy solid line) is made up of true score (dotted line) and error scores at
each level of true score (light solid lines). The variance of true score more closely approximates that of
observed score when the error variances are small and the reliability is greater.

7.1.1 Parallel Tests, Reliability, and Corrections for Attenuation

To ascertain the amount of this attenuation, and thereby discover the true correlation, it appears
necessary to make two or more independent series of observations of both p and q. (Spearman,
1904b, p 90)

Spearman’s solution to the problem of estimating the true relationship between two variables,
p and q, given observed scores p’ and q’ was to introduce two or more additional variables
that came to be called parallel tests. These were tests that had the same true score for each
individual and also had equal error variances. To Spearman (1904b p 90) this required finding
“the average correlation between one and another of these independently obtained series of
values” to estimate the reliability of each set of measures (r p� p� , rq� q� ), and then to find
r p� q�
r pq = √ . (7.3)
r p� p� rq� q�
7.1 Reliability and True Scores 209

Rephrasing Spearman (1904b, 1910) in more current terminology (Lord and Novick, 1968;
McDonald, 1999), reliability is the correlation between two parallel tests where tests are said
to be parallel if for every subject, the true scores on each test are the expected scores across
an infinite number of tests and thus the same, and the true score variances for each test are
the same (σ p2� = σ p2� = σ p2� ), and the error variances across subjects for each test are the same
1 2
(σe2� = σe2� = σe2� ) (see Figure 7.1). The correlation between two parallel tests will be
1 2

σ p� p� σ p2 + σ pe1 + σ pe2 + σe1 e2 σ p2


ρ p� p� = ρ p� p� = � 1 2 = = . (7.4)
1 2
σ p2� σ p2� σ p2� σ p2�
1 2

Expressing Equation 7.4 in terms of observed and true scores and comparing it to Equation 7.1
we see that the correlation between two parallel tests is the squared correlation of each test
with true score and is the percentage of test variance that is true score variance

σt2
ρxx = = ρxt2 . (7.5)
σx2

Reliability is the fraction of test variance that is true score variance. Knowing the reliability
of measures of p and q allows us to correct the observed correlation between p’ and q’ for
the reliability of measurement and to find the unattenuated correlation between p and q.
σ pq
r pq = � (7.6)
σ p2 σq2

and
σ p� q� σ p+e� σq+e� σ pq
r p� q� = � = �1 2
=� (7.7)
σ p2� σq2� σ p2� σq2� σ p2� σq2�

but from Eq 7.5,


σ p2 = ρ p� p� σ p2� (7.8)
and thus, by combining equation 7.6 with 7.7 and 7.8 the unattenuated correlation between
p and q corrected for reliability is Spearman’s equation 7.3
r p� q�
r pq = √ . (7.9)
r p� p� rq� q�

As Spearman recognized, correcting for attenuation could show structures that otherwise,
because of unreliability, would be hard to detect. A very thoughtful discussion of the necessity
of correcting measures for attenuation has been offered by Schmidt and Hunter (1999) who
suggest that all measures should be so corrected. Borsboom and Mellenbergh (2002) disagree
and suggests that rather than apply corrections for attenuation from classical test theory, it
is more appropriate to think in a structural modeling context. But as will be discussed in
Chapter 10, this will lead to almost the same conclusion. An example of the power of correct-
ing for attenuation may be seen in Table 7.1. The correlations below the diagonal represent
observed correlations, the entries on the diagonal the reliabilities, and the entries above the
diagonal, the correlations corrected for reliability using equation 7.9. The data were gener-
ated using the sim.structural function to represent three different latent variables with
210 7 Classical Test Theory and the Measurement of Reliability

a particular structure and then the corrections for attenuation were made using the cor-
rect.cor function. Note how the structural relationships are much clearer when correcting
for attenuation. That is, variables loading on the same factor have dis-attenuated correlations
of unity, and the dis-attenuated correlations between variables loading on different factors
reflects the correlations between the factors.

Table 7.1 Correlations can be corrected for attenuation using Equation 7.9. The raw correlations in
the last matrix were created from the factor (fx) and structure matrices (Phi) shown at the top of
the table using the sim.structural function. In the last matrix, raw correlations are shown below the
diagonal, reliabilities on the diagonal, and disattenuated correlations above the diagonal. Note how the
structural relationships are much clearer when correcting for attenuation. That is, variables loading on
the same factor have dis-attenuated correlations of unity, and the dis-attenuated correlations between
variables loading on different factors reflects the correlations between the factors.

#define the observed variable factor loadings


> fx <- matrix(c(.9,.8,.6,rep(0,7),.6,.8,-.7,rep(0,8),.6,.5,.4),ncol=3)
> colnames(fx) <- colnames(Phi)
> rownames(fx) <- paste("V",1:8,sep="")
> fx

F1 F2 F3
V1 0.9 0.0 0.0
V2 0.8 0.0 0.0
V3 0.6 0.6 0.0
V4 0.0 0.8 0.0
V5 0.0 -0.7 0.0
V6 0.0 0.0 0.6
V7 0.0 0.0 0.5
V8 0.0 0.0 0.4

#define the structural relationships


> Phi <- matrix(c(1,0,.707,0,1,rep(.707,3),1),ncol=3)
> colnames(Phi) <- rownames(Phi) <- paste("F",1:3,sep="")
> print(Phi,2)

F1 F2 F3
F1 1.0 0.0 0.7
F2 0.0 1.0 0.7
F3 0.7 0.7 1.0

> r <- sim.structural(fx,Phi) #create a correlation matrix with known structure


> print (correct.cor(r$model,r$reliability),2) #correct for reliability

V1 V2 V3 V4 V5 V6 V7 V8
V1 0.81 1.00 0.71 0.00 0.00 0.71 0.71 0.71
V2 0.72 0.64 0.71 0.00 0.00 0.71 0.71 0.71
V3 0.54 0.48 0.72 0.71 -0.71 1.00 1.00 1.00
V4 0.00 0.00 0.48 0.64 -1.00 0.71 0.71 0.71
V5 0.00 0.00 -0.42 -0.56 0.49 -0.71 -0.71 -0.71
V6 0.38 0.34 0.51 0.34 -0.30 0.36 1.00 1.00
V7 0.32 0.28 0.42 0.28 -0.25 0.30 0.25 1.00
V8 0.25 0.23 0.34 0.23 -0.20 0.24 0.20 0.16

However, defining reliability as the correlation between parallel tests still requires finding a
parallel test. But how do we know that two tests are parallel? For just knowing the correlation
7.1 Reliability and True Scores 211

between two tests, without knowing the true scores or their variance (and if we did, we
would not bother with reliability), we are faced with three knowns (two variances and one
covariance) but ten unknowns (four variances and six covariances). That is, the observed
correlation, r p� p� represents the two known variances s2p� and s2p� and their covariance s p� p� .
1 2 1 2 1 2
The model to account for these three knowns reflects the variances of true and error scores
for p�1 and p�2 as well as the six covariances between these four terms. In this case of two
tests, by defining them to be parallel with uncorrelated errors, the number of unknowns drop
to three (for the true scores variances of p�1 and p�2 are set equal, as are the error variances,
and all covariances with error are set to zero) and the (equal) reliability of each test may be
found.
Unfortunately, according to this concept of parallel tests, the possibility of one test being
far better than the other is ignored. Parallel tests need to be parallel by construction or
assumption and the assumption of parallelism may not be tested. With the use of more tests,
however, the number of assumptions can be relaxed (for three tests) and actually tested (for
four or more tests).

7.1.2 Tau equivalent and congeneric tests

With three tests, the number of assumptions may be reduced, and if the tests are tau equiva-
lent (individuals differ from each other in their true scores but each person has the same true
score on each test) or essentially tau equivalent (tests differ in their true score means but not
true score variance) then each test has the same covariance with true score), reliability for
each of the three tests may be found Novick and Lewis (1967). Tau equivalence or at least es-
sential tau equivalence is a basic (if unrealized) assumption of internal consistency estimates
of reliability (see 7.2.3). Using the notation of Table 7.2, for τ equivalence, λ1 = λ2 = λ3 , but
the σi2 need not be the same.
With four tests, to find the reliability of each test, we need only assume that the tests all
measure the same construct (to be “congeneric”), although possibly with different true score
saturations (λ1 ...λ4 ) and error score variances (Lord and Novick, 1968). The set of observed
variables and unknown parameters for each of four tests are shown in Table 7.2. When four
variables are all measures (of perhaps different quality) of one common factor, the variables
are said to be congeneric. The parameters may be estimated by exploratory or confirmatory
factor analysis. The reliabilities are the communalities of each variable (the squared factor
loadings) or 1- the uniqueness for each variable (Table 7.3). With three tests, the parameters
can be estimated, but the model is said to be fully saturated, in that there are no extra
degrees of freedom (six parameters are estimated by six observed variances and covariances.
With four tests, there are two degrees of freedom (eight parameters are estimated from 10
observed variances and covariances).
There are multiple ways of finding the parameters of a set of congeneric tests. Table 7.3
shows the results of an exploratory factor analysis using the factanal function. In Chapter 10
this same model is fit using a structural equation model by the sem package. Given the
loadings ( λi ) on the single latent factor, the reliability of each test is 1 - the uniqueness (or
error variance) for that test.
rxx = 1 − u2 = λi2
.
212 7 Classical Test Theory and the Measurement of Reliability

Table 7.2 Two parallel tests have two variances and one covariance. These allow us to estimate λ1 = λ2
and σe21 = σe22 and the true score variance. The parameters of τ equivalent tests can be estimated if
λ1 = λ2 = λ3 . For four congeneric tests, all parameters are free to vary.

V1 V2 V3 V4 V1 V2 V3 V4
V1 s21 λ1 σt2 + σe21
V2 s12 s22 λ1 λ2 σt2 λ2 σt2 + σe22
V3 s13 s23 s23 λ1 λ3 σt2 λ2 λ3 σt2 λ3 σt2 + σe23
V4 s14 s24 s34 s24 λ1 λ4 σt2 λ2 λ3 σt2 λ3 λ4 σt2 λ4 σt2 + σe24

Table 7.3 The congeneric model is a one factor model of the observed covariance or correlation matrix.
The test reliabilities will be 1- the uniquenesses of each test. The correlation matrix was generated from
a factor model with loadings of .9, .8, .7, and .6. Not surprisingly, a factor model correctly estimates
these parameters. If this analysis is redone with sample values for three variables, the one factor model
still fits perfectly (with 0 df), but the one factor model will not necessarily fit perfectly for four variables.
The reliability of each test is then ρii = λi2 = 1 − u2i . Thus, the reliabilities are .81, .64, .49, and .36 for
V1 . . .V4 respectively. Although the factanal function reports the uniquenesses (u2 ), the fa function in
the psych package reports h2 = 1 − u2 as well.

> f <- c(.9,.8,.7,.6)


> r <- sim.structural(f)
> r

Call: sim.structural(fx = f)

$model (Population correlation matrix)


V1 V2 V3 V4
V1 1.00 0.72 0.63 0.54
V2 0.72 1.00 0.56 0.48
V3 0.63 0.56 1.00 0.42
V4 0.54 0.48 0.42 1.00

$reliability (population reliability)


[1] 0.81 0.64 0.49 0.36

> factanal(covmat=r$model,factors=1)

Call:
factanal(factors = 1, covmat = r$model)

Uniquenesses:
V1 V2 V3 V4
0.19 0.36 0.51 0.64

Loadings:
Factor1
V1 0.9
V2 0.8
V3 0.7
V4 0.6

Factor1
SS loadings 2.300
Proportion Var 0.575

The degrees of freedom for the model is 2 and the fit was 0
7.2 Reliability and internal structure 213

Four congeneric tests

F1

0.9 0.8 0.7 0.6

V1 V2 V3 V4

Fig. 7.3 Fitting four congeneric measures by one factor. There are four observed variances and six
observed covariances. There are eight unknowns to estimate. This model was fit by a one factor ex-
ploratory factor model, although a one factor confirmatory model would work as well. The confirmatory
solution using the sem package is discussed in Chapter 10. The graph was done using the fa.diagram
function.

7.2 Reliability and internal structure

Unfortunately, with rare exceptions, we normally are faced with just one test, not two, three
or four. How then to estimate the reliability of that one test? Defined as the correlation
between a test and a test just like it, reliability would seem to require a second test. The
traditional solution when faced with just one test is to consider the internal structure of that
test. Letting reliability be the ratio of true score variance to test score variance (Equation 7.1),
or alternatively, 1 - the ratio of error variance to true score variance, the problem becomes
one of estimating the amount of error variance in the test. There are a number of solutions
to this problem that involve examining the internal structure of the test. These range from
considering the correlation between two random parts of the test to examining the structure
of the items themselves.

7.2.1 Split half reliability

If a test is split into two random halves, then the correlation between these two halves can
be used to estimate the split half reliability of the total test. That is, two tests, X, and a test
just like it, X� , with covariance, Cxx� can be represented as
214 7 Classical Test Theory and the Measurement of Reliability

 
..
 V x . C xx 

ΣXX � =
..........
 (7.10)
.
C � .. V �
xx x

and letting Vx = 1Vx 1� and CXX� = 1CXX � 1� the correlation between the two tests will be

C �
ρ = √ xx
VxVx�

But the variance of a test is simply the sum of the true covariances and the error variances:

Vx = 1Vx 1� = 1Ct 1� + 1Ve 1� = Vt +Ve

and the structure of the two tests seen in Equation 7.10 becomes
 
..
 VX = Vt + Ve . Cxx� = Vt 
ΣXX � =  
.............................
..
V = C � . V � +V � = V �
t xx t e X

and because Vt = Vt � and Ve = Ve� the correlation between each half, (their reliability) is

CXX � Vt Ve
ρ= = = 1− .
VX VX Vt
The split half solution estimates reliability based upon the correlation of two random split
halves of a test and the implied correlation with another test also made up of two random
splits:
 .. . 
 Vx1 . Cx1 x2 Cx1 x� .. Cx1 x� 
1 2
............ ............
 
 .. .. 
 Cx2 x� . Cx2 x� 
ΣXX � =  Cx1 x2 . Vx2 1 1 
 . . 
 
 Cx1 x� .. Cx2 x� Vx� .. Cx� x� 
 1 1 1 1 2 
. .
Cx1 x� .. Cx2 x� Cx� x� .. Vx�
2 2 1 2 2

Because the splits are done at random and the second test is parallel with the first test, the
expected covariances between splits are all equal to the true score variance of one split (Vt1 ),
and the variance of a split is the sum of true score and error variances:
7.2 Reliability and internal structure 215
 
.. ..
 Vt1 + Ve1 . Vt 1 Vt 1 . Vt 1 
................... ...................
 
 .. .. 
 V V V V V 
ΣXX � = t1 . t1 + e1 t1 . t1 
 . . 
 V .. Vt 1 .
Vt � + Ve � . Vt� 
 t1 1 1 1

 .. .. 
Vt 1 . Vt 1 Vt � . Vt� + Ve�
1 1 1

The correlation between a test made of up two halves with intercorrelation (r1 = Vt1 /Vx1 ) with
another such test is
4Vt1 4Vt1 4r1
rxx� = � = =
(4Vt1 + 2Ve1 )(4Vt1 + 2Ve1 ) 2Vt1 + 2Vx1 2r1 + 2

and thus
2r1
rxx� = (7.11)
1 + r1
This way of estimating the correlation of a test with a parallel test based upon the correlation
of two split halves correcting for the fact they were half tests rather than full tests (the split
half reliability)is a special case (n=2) of the the more general Spearman-Brown correction
(Brown, 1910; Spearman, 1910).
nr1
rxx = . (7.12)
1 + (n − 1)r1

It is important to remember that when finding the split half reliability that the observed
correlation between the two halves needs to be adjusted by the Spearman-Brown prophecy
formula (Equation 7.12) which for n = 2 is is just Equation 7.11.

7.2.2 Domain sampling

Other techniques to estimate the reliability of a single test are based on the domain sampling
model in which tests are seen as being made up of items randomly sampled from a domain
of items. Analogous to the notion of estimating characteristics of a population of people
by taking a sample of people is the idea of sampling items from a universe of items. (Lord
(1955), made the distinction between “Type 1” sampling of people, “Type 2” sampling of
items and “Type12” sampling of persons and items). Consider a test meant to assess English
vocabulary. A person’s vocabulary could be defined as the number of words in an unabridged
dictionary that he or she recognizes. But since the total set of possible words can exceed
500,000, it is clearly not feasible to ask someone all of these words. Rather, consider a test of
k words sampled from the larger domain of n words. What is the correlation of this test with
the domain? That is, what is the correlation across subjects of test scores with their domain
scores.?
216 7 Classical Test Theory and the Measurement of Reliability

7.2.2.1 Correlation of an item with a domain

First consider the correlation of a single (randomly chosen) item with the domain. Let the
domain score for an individual be Di and the score on a particular item, j, be Xi j . For ease
of calculation, convert both of these to deviation scores. di = Di − D̄ and xi j = Xi j − X̄ j . Then
covx j d
rx j d = � .
σx2j σd2

Now, because the domain is just the sum of all the items, the domain variance σd2 is just the
sum of all the item variances and all the item covariances
n n n n
σd2 = ∑ ∑ covx jk = ∑ σx2j + ∑ ∑ covx jk .
j=1 k=1 j=1 j=1 k�= j

j=n j=n
∑ ∑k�= j covx ∑ j=1 σx2j
Then letting c̄ = j=1n(n−1) jk be the average covariance and v̄ = n the average item
variance, the correlation of a randomly chosen item with the domain is

v̄ + (n − 1)c̄ v̄ + (n − 1)c̄
rx j d = � =� .
v̄(nv̄ + n(n − 1)c̄) nv̄(v̄ + (n − 1)c̄))

Squaring this to find the squared correlation with the domain and factoring out the common
elements leads to
(v̄ + (n − 1)c̄)
rx2j d = .
nv̄
and then taking the limit as the size of the domain gets large is

lim r2 = . (7.13)
n→∞ x j d v̄
That is, the squared correlation of an average item with the domain is the ratio of the
average interitem covariance to the average item variance. Compare the correlation of a test
with true score (Eq 7.5) with the correlation of an item to the domain score (Eq 7.13).
Although identical in form, the former makes assumptions about true score and error, the
latter merely describes the domain as a large set of similar items.

7.2.2.2 Correlation of a test with the domain

A similar analysis can be done for a test of length k with a large domain of n items. A k-item
test will have total variance, Vk , equal to the sum of the k item variances and the k(k-1) item
covariances:
k k k
Vk = ∑ vi + ∑ ∑ ci j = kv̄ + k(k − 1)c̄.
i=1 i=1 j�=i

The correlation with the domain will be


covk d kv̄ + k(n − 1)c̄ k(v̄ + (n − 1)c̄)
rkd = √ =� =�
VkVd (kv̄ + k(k − 1)c̄)(nv̄ + n(n − 1)c̄) nk(v̄ + (k − 1)c̄)(v̄ + (n − 1)c̄)
7.2 Reliability and internal structure 217

Then the squared correlation of a k item test with the n item domain is

2 k(v̄ + (n − 1)c̄)
rkd =
n(v̄ + (k − 1)c̄)

and the limit as n gets very large becomes


kc̄
lim r2 = . (7.14)
n→∞ kd v̄ + (k − 1)c̄

This is an important conclusion: the squared correlation of a k item test with a very large
domain will be a function of the number of items in the test (k) and the average covariance
of the items within the test (and by assumption, the domain). Compare Eq 7.12 to Eq 7.14.
The first, the Spearman-Brown prophecy formula estimates the reliability of a n-part test
based upon the average correlation between the n parts. The second, the squared correlation
of a test with the domain, estimates the fraction of test variance that is domain variance
based upon the average item variance, the average item covariance, and the number of items.
For standardized items, v̄ = 1 and c̄ = r̄ the two equations are identical.

7.2.3 The internal structure of a test. Part 1: coefficient α

Although defined in terms of the correlation of a test with a test just like it, reliability can
be estimated by the characteristics of the items within the test. The desire for an easy to use
“magic bullet” based upon the domain sampling model has led to a number of solutions for
estimating the reliability of a test based upon characteristics of the covariances of the items.
All of these estimates are based upon classical test theory and assume that the covariances
between items represents true covariance, but that the variances of the items reflect an
unknown sum of true and unique variance. From the variance of a composite (Eq 5.1), it is
known that the variance of a total test, σx2 made up of a sum of individual items, xi is

σx2 = ∑ σxi x j + ∑ σx2i . (7.15)


i�= j

After earlier work introduced various shortcuts (Kuder and Richardson, 1937) that did not
require finding the covariances, Guttman (1945), in an attempt to formalize the estimation of
reliability, proposed six lower bounds for reliability, ρxx , that took advantage of the internal
structure of the test
σ2 σ2 −σ2
ρxx = t2 = x 2 e .
σx σx
Each one successively modifies the way that the error variance of the items are estimated.
Unfortunately, although many psychometricians deplore its use, one of these estimates, λ3
(Guttman, 1945), also known as coefficient alpha (Cronbach, 1951), is by far and away the
most common estimate of reliability. The appeal of α is both that it is easy to compute, is
easy to understand, and is available in all statistical programs (including the psych package
in R). To understand the appeal of α, as well as the reasons not to rely solely on it, it is
necessary to consider a number of alternatives.
218 7 Classical Test Theory and the Measurement of Reliability

Although splitting a test into two and calculating the reliability based upon the correlation
between the two halves corrected by the Spearman-Brown formula was one way of estimating
reliability, the result would depend upon the particular split. When considering the number
of possible split halves of an k-item test ( k!
k 2 ), Kuder and Richardson (1937) introduced
2( 2 !)
a short cut formula for reliability in terms of the total test variance, σx2 , and the average
item variance, pq of a true-false test where pi and qi represent the percentage passing and
failing any one item. The Kuder-Richardson 20 formula could be calculated without finding
the average covariance or correlation required by Eq 7.12 and Eq 7.14 by taking advantage
of the identity of Eq 7.15
σ2 k σx2 − kpq
rxx = t2 = . (7.16)
σx k−1 σx2
Functionally, this is finding the total true score variance of an n item test as k2 times the
average covariance, σt2 = k2 c̄, and recognizing that the total test variance represents the sum of
the item covariances plus the sum of the item error variances σx2 = k2 c̄ + kσ¯e2 . Taking the total
test variance, subtracting the sum of the item variances (k times the average variance) and
dividing by k(k −1) (the number of covariance terms) gives an average covariance. Multiplying
this by k2 gives the total true score variance which when divided by the total test variance
is the test reliability. The clear advantage of KR20 was that it could be calculated without
finding the inter-item covariances or correlations, but just the total test variance and the
average item variance.
An even easier shortcut was to estimate the average variance by finding the variance of the
average item (the 21st formula of Kuder and Richardson, 1937) now known as KR21 . That
is, by finding the percent passing the average item, p̄, it is possible to find the variance of
the average item, p̄q̄, which will be a positively biased estimate of the average item variance
and thus a negatively biased estimate of reliablity. (Unless all items have equal probabilities
of success, the variance of the average item will be greater than the average of the variance
of the items).
σ2 k σx2 − k p̄q̄
rxx = t2 = .
σx k−1 σx2
Coefficient alpha (Cronbach, 1951) is a straight forward generalization of KR20 to the
case of non-dichotomous as well as dichotomous items. Letting σi2 represent the variance of
itemi , and σx2 the variance of the total total test, then the average covariance of an item with
any other item is
σ 2 − ∑ σi2
c̄ = x
k(k − 1)
and thus the ratio of the total covariance in the test to the total variance in the test is
σ 2 −∑ σ 2
σt2 k2 k(k−1)
x i
k σx2 − ∑ σi2
α = rxx = 2 = = (7.17)
σx σx2 k−1 σx2

which is just KR20 but using the sum of the item variances rather than n times the average
variance and allows for non-dichotomous items.
An alternate way of finding coefficient alpha based upon finding the average covariance
between items is to consider the ratio of the total covariance of the test to the total variance
7.2 Reliability and internal structure 219

k2 c̄ kc̄
α= = (7.18)
kv̄ + k(k − 1)c̄ v̄ + (k − 1)c̄

which for standardized items is just


kr̄
α= . (7.19)
1 + (k − 1)r̄

In the preceding few pages, six important equations have been introduced. All six of these
equations reflect characteristics of the reliability of a test and how observed scores relate to
true score. Reliability, defined as the correlation between two parallel forms of a test, is the
same as the squared correlation of either test with true score and is the amount of true score
variance in the test. Reliability is an increasing function of the correlation between random
split halves of either test. Coefficient α, derived from Eq 7.19, is the same as the reliability as
estimated by the Spearman-Brown prophecy formula (Eq 7.12), but is derived from domain
sampling principles, and as is seen in Eq 7.18 is the same as the squared correlation of a
k-item test with the domain (Eq 7.14). As seen in Eq 7.17, coefficient α is the same as the
the reliability of test found for dichotomous items using formula KR20, Eq 7.16. That is, all
six of these equations, although derived in different ways by different people have identical
meanings.
As an example of finding coefficient α, consider the five neuroticism items from the bfi
data set. This data set contains 25 items organized in five sets of five items to measure each
of the so-called “Big Five” dimensions of personality (Agreeableness, Conscientiousness, Ex-
traversion, Neuroticism, and Openness). All five of these are scored in the same direction and
can be analyzed without reverse keying any particular item. The analysis in Table 7.7 reports
both the values of α based upon the covariances (raw) as well as the correlations (standard-
ized). In addition, Guttman’s coefficient λ6 (discussed below) of the reliability based upon
the squared multiple correlation (smc) for each items as well as the average intercorrelation
of the items are reported. The correlation of each item with the total scale is reported in two
forms, the first is just the raw correlation which reflects item overlap, the second corrects for
item overlap by replacing each item’s correlation with itself (1.0) with the estimated item
reliability based upon the smc.

7.2.4 The internal structure of a test. Part 2: Guttman’s lower


bounds of reliability

Although arguing that reliability was only meaningful in the case of test-retest, Guttman
(1945) may be credited with introducing a series of lower bounds for reliability, λ1 . . . λ6 , each
based upon the item characteristics of a single test. These six have formed the base for most
of the subsequent estimates of reliability based upon the item characteristics of a single test.
Of the six, λ3 is most well known and was called coefficient alpha or α by Cronbach (1951).
All of these measures decompose the total test variance, Vx , into two parts, that associated
with error, Ve and what ever is left over, Vx − Ve . (although not using the term true score,
this is implied). Reliability is then just
Vx −Ve Ve
rxx = = 1− . (7.20)
Vx Vx
220 7 Classical Test Theory and the Measurement of Reliability

Table 7.4 Coefficient α may be found using the alpha function in psych. The analysis is done for 5
neuroticism items taken from the bfi data set.

> alpha(bfi[16:20])

Reliability analysis
Call: alpha(x = bfi[16:20])

raw_alpha std.alpha G6(smc) average_r mean sd


0.81 0.81 0.8 0.46 15 5.8

Reliability if an item is dropped:


raw_alpha std.alpha G6(smc) average_r
N1 0.75 0.75 0.70 0.42
N2 0.76 0.76 0.71 0.44
N3 0.75 0.76 0.74 0.44
N4 0.79 0.79 0.76 0.48
N5 0.81 0.81 0.79 0.51

Item statistics
n r r.cor mean sd
N1 990 0.81 0.78 2.8 1.5
N2 990 0.79 0.75 3.5 1.5
N3 997 0.79 0.72 3.2 1.5
N4 996 0.71 0.60 3.1 1.5
N5 992 0.67 0.52 2.9 1.6

The problem then becomes how to estimate the error variance, Ve .


Consider the case of a test made up of 12 items, all of which share 20% of their variance
with a general factor, but form three subgroups of (6, 4 and 2) items which share 30%, 40%
or 50% of their respective variance with some independent group factors. The remaining
item variance is specific or unique variance (Table 7.5, Figure 7.4). An example of this kind
of a test might be a measure of Neuroticism, with a broad general factor measuring general
distress, group factors representing anger, depression, and anxiety, and specific item variance.
For standardized items this means that the correlations between items in different groups
are .2, those within groups are .5, .6 or .7 for groups 1, 2 and 3 respectively. The total test
variance is thus

Vt = 122 ∗ .2 + 62 ∗ .3 + 42 ∗ .4 + 22 ∗ .5 + 6 ∗ .5 + 4 ∗ .4 + 2 ∗ .3 = 53.2

and the error variance Ve is

Ve = 6 ∗ .5 + 4 ∗ .4 + 2 ∗ .3 = 5.2

for a reliability of
5.2
rxx = 1 − = .90.
53.2
Using the data matrix formed in Table 7.5 and shown in Figure 7.4, we can see how various
estimates of reliability perform.
The first Guttman lowest bound, λ1 considers that all of an item variance is error and
that only the interitem covariances reflect true variability. Thus, λ1 subtracts the sum of the
7.2 Reliability and internal structure 221

Table 7.5 A hypothetical 12 item test can be thought of as the sum of the general variances of all
items, group variances for some items, and specific variance for each item. (See Figure 7.4). An example
of this kind of a test might be a measure of Neuroticism, with a broad general factor, group factors
representing anger, depression, and anxiety, and specific item variance.

> general <- matrix(.2,12,12)


> group <- super.matrix( super.matrix(matrix(.3,6,6),matrix(.4,4,4)),matrix(.5,2,2))
> error <- diag(c(rep(.5,6),rep(.4,4),rep(.3,2)),12,12)
> Test <- general + group + error
> colnames(Test ) <- rownames(Test) <- paste("V",1:12,sep="")
> round(Test,2)

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12


V1 1.0 0.5 0.5 0.5 0.5 0.5 0.2 0.2 0.2 0.2 0.2 0.2
V2 0.5 1.0 0.5 0.5 0.5 0.5 0.2 0.2 0.2 0.2 0.2 0.2
V3 0.5 0.5 1.0 0.5 0.5 0.5 0.2 0.2 0.2 0.2 0.2 0.2
V4 0.5 0.5 0.5 1.0 0.5 0.5 0.2 0.2 0.2 0.2 0.2 0.2
V5 0.5 0.5 0.5 0.5 1.0 0.5 0.2 0.2 0.2 0.2 0.2 0.2
V6 0.5 0.5 0.5 0.5 0.5 1.0 0.2 0.2 0.2 0.2 0.2 0.2
V7 0.2 0.2 0.2 0.2 0.2 0.2 1.0 0.6 0.6 0.6 0.2 0.2
V8 0.2 0.2 0.2 0.2 0.2 0.2 0.6 1.0 0.6 0.6 0.2 0.2
V9 0.2 0.2 0.2 0.2 0.2 0.2 0.6 0.6 1.0 0.6 0.2 0.2
V10 0.2 0.2 0.2 0.2 0.2 0.2 0.6 0.6 0.6 1.0 0.2 0.2
V11 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 1.0 0.7
V12 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.7 1.0

> sum(Test)
[1] 53.2
> sum(general)
[1] 28.8
> sum(group)
[1] 19.2
> sum(error)
[1] 5.2

diagonal of the observed item covariance matrix from the total test variance:

tr(Vx ) Vx − tr(Vx )
λ1 = 1 − = . (7.21)
Vx Vx
This leads to an estimate of
12 41.2
λ1 = 1 − = = .774.
53.2 53.2
The second bound, λ2 replaces the diagonal with a function of the square root of the sums
of squares of the off diagonal elements. Let C2 = 1(V − diag(V))2 1� , then
� �
n n
n−1 C2 Vx − tr(V x ) + n−1 C2
λ2 = λ1 + = . (7.22)
Vx Vx
222 7 Classical Test Theory and the Measurement of Reliability

Total = g + Gr + E General = .2
!2= 53.2 !2= 28.8

V9 V7 V5 V3 V1

V9 V7 V5 V3 V1
V 12

V 12
V1 V3 V5 V7 V9 V 11 V1 V3 V5 V7 V9 V 11

3 groups = .3, .4, .5 Item Error


!2 = 19.2 !2= 5.2
V9 V7 V5 V3 V1

V9 V7 V5 V3 V1
!2 = 10.8

!2 = 6.4

!2 = 2
V 12

V 12

V1 V3 V5 V7 V9 V 11 V1 V3 V5 V7 V9 V 11

Fig. 7.4 The total variance of a test may be thought of as composed of a general factor, several group
factors, and specific item variance. The problem in estimating total reliability is to determine how much
of the total variance is due to specific or unique item variance. Shading represent the magnitude of
the correlation. All items are composed of 20% general factor variance, the first six items have a group
factor accounting for 30% of their variance, the next four items have a stronger group factor accounting
for 40% of their variance, and the last two items define a very powerful group factor, accounting for
50% of their variance. The items are standardized, and thus have total variance of 1.

Effectively, this is replacing the diagonal with n * the square root of the average squared off
diagonal element. �
12
11 16.32
λ2 = .774 + = .85
53.2
Guttman’s 3rd lower bound, λ3 , also modifies λ1 and estimates the true variance of each
item as the average covariance between items and is, of course, the same as Cronbach’s α.
VX −tr(VX )
n(n−1) nλ1 n � tr(V)x � n Vx − tr(Vx )
λ3 = λ1 + = = 1− = =α (7.23)
VX n−1 n−1 Vx n−1 Vx
7.2 Reliability and internal structure 223

This is just replacing the diagonal elements with the average off diagonal elements. λ2 ≥ λ3
with λ2 > λ3 if the covariances are not identical.

12 � 12 � 12 (53.2 − 12)
λ3 = α = 1− = = .84
11 53.2 11 53.2
As pointed out by Ten Berge and Zegers (1978), λ3 and λ2 are both corrections to λ1 and
this correction may be generalized as an infinite set of successive improvements.
1� �
µr = po + (p1 + (p2 + . . . (pr−1 + (pr )1/2 )1/2 . . . )1/2 )1/2 , r = 0, 1, 2, . . . (7.24)
Vx
where
ph = ∑ σi2h
j , h = 0, 1, 2, . . . r − 1
i�= j

and
n
ph = σ 2h , h = r.
n−1 ij
Clearly µ0 = λ3 = α and µ1 = λ2 . µr ≥ µr−1 ≥ . . . µ1 ≥ µ0 , although the series does not improve
much after the first two steps (Ten Berge and Zegers, 1978).
Guttman’s fourth lower bound, λ4 was originally proposed as any spit half reliability
(Guttman, 1945) but has been interpreted as the greatest split half reliability (Jackson and
Agunwamba, 1977). If X is split into two parts, Xa and Xb , with covariance cab then
� VX +VXb � 4cab 4cab
λ4 = 2 1 − a = = . (7.25)
VX Vx VXa +VXb + 2cabVXa VXb

If the two parts are standardized, the covariance becomes a correlation, the two variances
are equal and λ4 is just the normal split half reliability, but in this case, of the most similar
splits. In the case of the example, there are several ways that lead to a “best split”, but any
scale made up 3 items from the first six, two from the second four and one from the last two
will correlate .82 with the corresponding other scale. Correcting this for test length using the
Spearman-Brown correction leads to
2 ∗ .82
λ4 = = .90.
1 + .82
In the general case of splits with unequal variances, it is better to use Equation 7.25 rather
than 7.12.
λ5 , Guttman’s fifth lower bound, replaces the diagonal values with twice the square root
of the maximum (across items) of the sums of squared interitem covariances

2 C¯2
λ5 = λ1 + . (7.26)
VX
Although superior to λ1 , λ5 underestimates the correction to the diagonal. A better estimate
would be analogous to the correction used in λ3 :

n 2 C¯2
λ5+ = λ1 + . (7.27)
n − 1 VX
224 7 Classical Test Theory and the Measurement of Reliability

Guttman’s final bound considers the amount of variance in each item that can be accounted
for the linear regression of all of the other items (the squared multiple correlation or smc),
or more precisely, the variance of the errors, e2j , and is

∑ e2j 2 )
∑(1 − rsmc
λ6 = 1 − = 1− (7.28)
Vx Vx

Although the smc used in finding Guttman’s λ6 is normally found by using just the other
items in the particular scale, in a multiple scale inventory, the concept can be generalized to
consider the smc based upon all the other items. In the psych package this is labeled as λ6+ .
Using the smc function to find the smcs for each item and them summing them across all
items
6.51
λ6 = 1 − = .878.
53.2
Yet another estimate that has been proposed for the reliability of a principal component
(Ten Berge and Hofstee, 1999) unfortunately also uses λ1 as a symbol, but this time as the
magnitude of the the eigenvalue of the principal component
n 1
α pc = (1 − ). (7.29)
n−1 λ1
12 � 1 �
α pc = 1− = .847.
11 4.48
The discussion of various lower bounds of reliability seemed finished when Jackson and
Agunwamba (1977) and Bentler and Woodward (1980) introduced their “greatest lower
bound”, or glb. Woodhouse and Jackson (1977) organized Guttman’s six bounds into a series
of partial orders, and provided an algorithm for estimating the glb of Jackson and Agun-
wamba (1977). An alternative algorithm was proposed by Bentler and Woodward (1980) and
discussed by Sijtsma (2008). Unfortunately, none of these authors considered ωt (see below),
which tends to exceed the glbs reported in the various discussions of the utility of the glb
(Revelle and Zinbarg, 2009).
The Guttman statistics as well as those discussed Ten Berge and Zegers (1978) by may
be found using the guttman function in psych.

7.2.5 The internal structure of a test. Part 3: coefficients α, β , ωh


and ωt

Two additional coefficients, ωh and ωt , that were not considered by either Guttman (1945) or
Cronbach (1951) were introduced by McDonald (1978, 1999). These two coefficients require
factor analysis to estimate, but are particularly useful measures of the structure of a test.
McDonald’s ωt is similar to Guttman’s λ6 , but uses the estimates of uniqueness (u2j ) for each
item from factor analysis to find e2j . This is based on a decomposition of the variance of a test
score, Vx , into four parts: that due to a general factor, g, that due to a set of group factors,
f, (factors common to some but not all of the items), specific factors, s unique to each item,
and e, random error. (Because specific variance can not be distinguished from random error
unless the test is given at least twice, McDonald (1999) combines these both into error).
7.2 Reliability and internal structure 225

Letting
x = cg + Af + Ds + e (7.30)
then the communality of item j , based upon general as well as group factors,

h2j = c2j + ∑ fi2j (7.31)

and the unique variance for the item

u2j = σ 2j (1 − h2j ) (7.32)

may be used to estimate the test reliability. That is, if h2j is the communality of item j , based
upon general as well as group factors, then for standardized items, e2j = 1 − h2j and

2
1cc� 1� + 1AA� 1� ∑(1 − h j ) ∑ u2
ωt = = 1− = 1− (7.33)
Vx Vx Vx
2 , ω ≥ λ . For the example data set, the uniquenesses may be found by factor
Because h2j ≥ rsmc t 6
analysis (Table 7.6) and their sum is 5.2 (compare with Figure 7.4). Thus,

5.2
ωt = 1 − = .90.
53.2
McDonald introduced another reliability coefficient, also labeled ω, based upon the satu-
ration of the general factor. It is important to distinguish here between the two ω coefficients
of McDonald (1978) and (McDonald, 1999, Equation 6.20a), ωt and ωh . While the former is
based upon the sum of squared loadings on all the factors, the latter is based upon the sum
of the squared loadings on the general factor, g. For a correlation matrix, R with general
factor g with loadings c
1cc� 1 (∑ Λi )2
ωh = = . (7.34)
Vx ∑ ∑ Ri j
That is, ωh is the ratio of the sum of correlations reproduced by the general factor to the sum
of all correlations. It is the percentage of a correlation matrix associated with the general
factor. For the example,
5.3665632 28.8
ωh = = = .54.
53.2 53.2
As is true for the other estimates of reliability, because the variance associated with the
uniquenesses of each item becomes a smaller and smaller fraction of the test as the test be-
comes longer, ωt will increase as a function of the number of variables and tend asymptotically
towards 1.0. However, ωh will not, and rather will tend towards a limit of

1cc� 1
ωh∞ = . (7.35)
1cc� 1 + 1AA� 1�
ωh is particularly important when evaluating the importance and reliability of the general
factor of a test, while ωt is an estimate of the total reliable variance in a test. As was discussed
earlier (6.3.4) measures of cognitive ability have long been analyzed in terms of lower order
factors (group factors) as well as a higher order, general factor ?Horn and Cattell (1966,
1982). More recently, this approach has also been applied to the measurement of personality
226 7 Classical Test Theory and the Measurement of Reliability

Table 7.6 The omega function does a factor analysis followed by an oblique rotation and extraction
of a general factor using the schmid-leiman transformation Schmid and Leiman (1957). The sum of
the uniquenesses is used to find ωt and the squared sum of the g loadings to find ωh ).

> omega(Test)

Omega
Call: omega(m = Test)
Alpha: 0.84
G.6: 0.88
Omega Hierarchical: 0.54
Omega H asymptotic: 0.6
Omega Total 0.9

Schmid Leiman Factor loadings greater than 0.2


g F1* F2* F3* h2 u2
V1 0.45 0.55 0.5 0.5
V2 0.45 0.55 0.5 0.5
V3 0.45 0.55 0.5 0.5
V4 0.45 0.55 0.5 0.5
V5 0.45 0.55 0.5 0.5
V6 0.45 0.55 0.5 0.5
V7 0.45 0.63 0.6 0.4
V8 0.45 0.63 0.6 0.4
V9 0.45 0.63 0.6 0.4
V10 0.45 0.63 0.6 0.4
V11 0.45 0.71 0.7 0.3
V12 0.45 0.71 0.7 0.3

With eigenvalues of:


g F1* F2* F3*
2.4 1.8 1.6 1.0

general/max 1.33 max/min = 1.8


The degrees of freedom for the model is 33 and the fit was 0

Measures of factor score adequacy


g F1* F2* F3*
Correlation of scores with factors 0.74 0.77 0.81 0.81
Multiple R square of scores with factors 0.55 0.60 0.66 0.66
Minimum correlation of factor score estimates 0.10 0.20 0.31 0.33

traits such as anxiety which shows lower order factors as well as higher order one (Chen
et al., 2006; Zinbarg and Barlow, 1996; Zinbarg et al., 1997). For tests that are thought to
have a higher order structure, measures based upon just the average interitem correlation,
α or λ6 , are not appropriate. Coefficients that reflect the structure such as ωh and ωt are
more appropriate. If a test is composed of relatively homogeneous items then α and λ6 will
provide very similar estimates to ωh and ωt . ωh , ωh∞ , ωt , α and λ6 may all be found using the
omega function (Table 7.6).
ωh is an estimate of the general factor saturation of a test based upon a factor analytic
model. An alternative estimate, coefficient β Revelle (1979), uses hierarchical cluster analysis
to find the two most unrelated split halves of the test and then uses the implied inter-group
itemcorrelation to estimate the total variance accounted for by a general factor. This is based
7.3 A comparison of internal consistency estimates of reliability 227

Table 7.7 Four data sets with equal α reliability estimates but drastically different structures. Each
data set is assumed to represent two correlated clusters. The between cluster correlations are .45, .32,
.14, and 0. For all four data sets, α = .72. Because the within cluster factor loadings are identical within
each set, the two estimates of general factor saturation, β and ωh , are equal. They are .72, .48, .25, and
0 for sets S1, S2, S3 and S4 respectively. Figure 7.5 displays these matrices graphically using cor.plot.

S1 S2
V1 V2 V3 V4 V5 V6 V1 V2 V3 V4 V5 V6
V1 1.0 0.3 0.3 0.3 0.3 0.3 1.00 0.45 0.45 0.20 0.20 0.20
V2 0.3 1.0 0.3 0.3 0.3 0.3 0.45 1.00 0.45 0.20 0.20 0.20
V3 0.3 0.3 1.0 0.3 0.3 0.3 0.45 0.45 1.00 0.20 0.20 0.20
V4 0.3 0.3 0.3 1.0 0.3 0.3 0.20 0.20 0.20 1.00 0.45 0.45
V5 0.3 0.3 0.3 0.3 1.0 0.3 0.20 0.20 0.20 0.45 1.00 0.45
V6 0.3 0.3 0.3 0.3 0.3 1.0 0.20 0.20 0.20 0.45 0.45 1.00

S3 S4
V1 V2 V3 V4 V5 V6 V1 V2 V3 V4 V5 V6
V1 1.0 0.6 0.6 0.1 0.1 0.1 1.00 0.75 0.75 0.00 0.00 0.00
V2 0.6 1.0 0.6 0.1 0.1 0.1 0.75 1.00 0.75 0.00 0.00 0.00
V3 0.6 0.6 1.0 0.1 0.1 0.1 0.75 0.75 1.00 0.00 0.00 0.00
V4 0.1 0.1 0.1 1.0 0.6 0.6 0.00 0.00 0.00 1.00 0.75 0.75
V5 0.1 0.1 0.1 0.6 1.0 0.6 0.00 0.00 0.00 0.75 1.00 0.75
V6 0.1 0.1 0.1 0.6 0.6 1.0 0.00 0.00 0.00 0.75 0.75 1.00

upon the observation that the correlation between the two worst splits reflects the covariances
of items that have nothing in common other that what is common to all the items in the
test. For the example data set, the two most unrelated parts are formed from the first 10
items and the last two items. The correlation between these two splits is .3355 which implies
an average correlation of the items between these two halves of .20. These correlations reflect
the general factor saturation and when corrected for test length implies that the general
saturation of this 12 item test is 144* .2 = 28.8. The total test variance is 53.2 and thus
12 ∗ 12 ∗ .2
β= = .54.
53.2
Although in the case of equal correlations within groups and identical correlations between
groups, ωh and β are identical, this is not the case for group factors with unequal general
factor loadings. Whether β or ωh will be greater depends upon the specific pattern of the
general factor loadings (Zinbarg et al., 2005).

7.3 A comparison of internal consistency estimates of reliability

If there are so many different measures of reliability, the question to ask is which reliability
estimate should be used, and why. Consider the four example data sets in Table 7.7 (shown
graphically in Figure 7.5). All four of these data sets (S1 . . . S4) have equal average correlations
(.3) and thus identical values of coefficient α (.72). However, by looking at the correlations,
it is clear that the items in S1 represent a single construct, with all items having equal
228 7 Classical Test Theory and the Measurement of Reliability

S1: no group factors S2: large g, small group factors

V2

V2
V4

V4
V6

V6
V1 V2 V3 V4 V5 V6 V1 V2 V3 V4 V5 V6

S3: small g, large group factors S4: no g but large group factors
V2

V2
V4

V4
V6

V6

V1 V2 V3 V4 V5 V6 V1 V2 V3 V4 V5 V6

Fig. 7.5 The correlation matrices of Table 7.7 may be represented graphically using the cor.plot func-
tion. Although all four matrices have equal α reliabilities (.72), they differ drastically in the saturation
of a general factor (omegah = β = .72, .48, .25 and .00.)

correlations of .3. The items in set S-4, on the other hand, form two distinct groups, even
though the average intercorrelation of all the items remains .3 and α remains .72. Some of
the other estimates of reliability discussed above, on the other hand, vary from sets S1 to S4
(Table 7.8). In particular, the two ways of estimating the general factor saturation, ωh and β
go from .72 when the test is unifactorial (S-1) to 0 when it contains two independent factors
(S-4). ωt , on the other hand, increases from .72 in the case of a unifactorial test (S-1) to .90
in the case of a test containing two independent factors (S-4).
To understand how Guttman’s bounds relate to each other and to ωh , ωt , and β , it is useful
to consider the “Test” data from Table 7.5 as well as the four sample correlation matrices
from Table 7.7 as well as three demonstrations of a higher order structure, one simulated and
two from the bifactor data set in the psych package (see Table 7.8). The simulated data set
(S.9) was created using the sim.hierarchical function to demonstrate a hierarchical factor
model as discussed by Jensen and Weng (1994) and shown earlier (see Table 6.13). The two
real data sets are the Thurstone example from McDonald (1999) of 9 cognitive variables used
to show a clear bifactor (hierarchical) structure. The second example of a bifactor structure
7.3 A comparison of internal consistency estimates of reliability 229

3 simulated factors 3 cogitive factors


! = .69 ! = .79

V1

First.Letters
V3
V5
V7

Letter.Group
V9

V1 V3 V5 V7 V9 Sentences 4.Letter.Words Letter.Group

5 health factors 2 sets of personality items


! = .82 ! = .40

A2
listen

A4
helpful

C1
C3
paperwork

C5

phone explain helpful delay A1 A3 A5 C2 C4

Fig. 7.6 One simulated and three real data sets. The first is an example of hierarchical structure
created by the sim.hierachical function based upon an article by Jensen and Weng (1994). The
second and third data sets, Thurstone and Reise are from the bifactor data set and are nine cognitive
variables adapted by Bechtoldt (1961) from Thurstone and Thurstone (1941), the last is a set of 10
items thought to measure two different traits (Agreeableness and Conscientiousness) taken from the
bfi data set.

are 14 health related items from Reise et al. (2007). The last real example, the BFI, uses 10
items from the bfi dataset (examples of the “Big Five Inventory”) and is an example of two
distinct constructs incorrectly combined into one scale. The 10 items from the BFI represent
five items measuring “Agreeableness” and five measuring “Conscientiousness”. Normally seen
as separate traits, they are included as example of how a large α is not a sign of a single
dimension.
Comparing the S1 . . . S4 data sets, alpha is equal to all other measures and superior to
λ4 and λ6 if there is exactly one factor in the data (S1). But as the general factor becomes
less important, and the group factors more important (S2 ... S4), α does not change, but
the other Guttman coefficients do. λ6 , based upon the smc as an estimate of item reliability,
underestimates the reliability for the completely homogeneous set, but exceeds α as the test
becomes more multifaceted.
230 7 Classical Test Theory and the Measurement of Reliability

Table 7.8 Comparison of 13 estimates of reliability. Test is the data set from Table 7.5. The next four
data sets are S1-S4 from Table 7.7 and Figure 7.5. S.9 is a simulated hierarchical structure using the
sim.hierarchical function based upon Jensen and Weng (1994), T.9 is the 9 cognitive variables used
by McDonald (1999), R14 is 14 health related items from Reise et al. (2007), the BFI is an example of
two distinct constructs “incorrectly” combined into one scale. (See Figure 7.6). λ1 . . . λ6 are the Guttman
(1945) bounds found by the guttman function as are µ0 . . . µ3 from Ten Berge and Zegers (1978), ωh and
ωt are from McDonald (1999) and found by the omega function, β is from Revelle (1979) and found by
the ICLUST function. Because β and ωh reflect the general factor saturation they vary across the S-1 ...
S-4 data sets and are much lower than α or ωt for incorrectly specified scales such as the BFI.

Estimate Test S-1 S-2 S-3 S-4 S.9 T.9 R14 BFI
β (min) .54 .72 .48 .25 .00 .57 .76 .79 .40
ωh .54 .72 .48 .25 .00 .69 .74 .78 .36
ωh∞ .60 1.00 .62 .29 .00 .86 .79 .85 .47
λ1 .77 .60 .60 .60 .60 .68 .79 .85 .65
λ3 (α, µ0 ) .84 .72 .72 .72 .72 .76 .89 .91 .72
α pc .85 .72 .72 .72 .72 .77 .89 .91 .73
λ2 (µ1 ) .85 .72 .73 .75 .79 .77 .89 .91 .74
µ2 .86 .72 .73 .76 .80 .77 .90 .91 .74
µ3 .86 .72 .73 .76 .80 .77 .90 .91 .74
λ5 .82 .69 .70 .72 .74 .75 .87 .89 .71
λ6 (smc) .88 .68 .72 .78 .86 .76 .91 .92 .75
λ4 (max) .90 .72 .76 .83 .89 .76 .93 .93 .82
glb .90 .72 .76 .83 .89 .76 .93 .93 .82
ωt .90 .72 .78 .84 .90 .86 .93 .92 .77

In that reliability is used to correct for attenuation (equation 7.3), underestimating the
reliability will lead to an over estimate of the unattenuated correlation and overestimating the
reliability will lead to an under estimate of the unattenuated correlation. Choosing the proper
reliability coefficient is therefore very important and should be guided by careful thought and
strong theory. In the case in which our test is multidimensional and several of the dimensions
contribute to the prediction of the criterion of interest, α will underestimate the reliabilty,
and thus lead to an overcorrection, but unfortunately, so will most of the estimates. ωt will
lead to a more accurate correction. In the case in which the test is multidimensional but only
the test’s general factor contributes to the prediction of the criterion of interest, α will over
estimate the reliability associated with the general factor and lead to an undercorrection. ωh
would lead to a more accurate correction in this case.

7.4 Estimation of reliability

As initially introduced by Spearman, reliability was used to correct for the attenuation of
relationships due to error in measurement. The initial concept of reliability, rxx , was the
correlation with a parallel test. This correlation allowed for an estimate of the percent of
error variance in the test. Congeneric test theory elaborated this concept such that test
reliability was the test’s communality (the squared factor loading) on a latent factor common
to multiple measures of the construct. Further refinements in domain sampling theory led
to ways of estimating the percentage of reliable variance associated with the general factor
of the test, ωh , or the entire test, ωt . But all of these estimates are based upon the idea
7.4 Estimation of reliability 231

that there is one source of true variance to be estimated. An alternative approach recognizes
that scores have multiple sources of reliable variance and that for different questions we want
to generalize across different sources of variance. That is, do not say that a measure has a
reliability, but rather that it has different reliabilities, depending upon what aspects of the
test are being considered.
In addition to having true score variance, tests are seen as having additional sources of
variance, some of which are relevant and some of which are irrelevant when making decisions
using the test. The test variance thus needs to be decomposed into variance associated with
item, form, time, and source of information. To make the problem more complex, all of these
components can interact with each other and produce their own components of variance.
Thus, rather than use correlations as the index of reliability, generalizability theory introduced
by Cronbach et al. (1972) used an analysis of variance or ANOVA approach to decompose
the test variance. Only some of these sources of variance are relevant when making decisions
using the test.

Table 7.9 Reliability is the ability to generalize about individual differences across alternative sources
of variation. Generalizations within a domain of items use internal consistency estimates. If the items
are not necessarily internally consistent reliability can be estimated based upon the worst split half, β ,
the average split (corrected for test length) or the best split, λ4 . Reliability across forms or across time
is just the Pearson correlation. Reliability across raters depends upon the particular rating design and
is one of the family of Intraclass correlations.

Generalization Type of reliability Name


over
Unspecified Parallel tests rxx
Items Internal consistency
general factor (g) ωh
> g < h2 α
all common (h2 ) ωt
2r12
Split halves random split half 1+r12
worst split half β
best split half λ4
Form Alternative form rxx
Time Test-retest rxx
Raters Single rater ICC2
Average rater ICC2k

7.4.1 Test-retest reliability: Stability across time

Perhaps the most simple example of the different components of variance associated with
reliability is to consider the reliabilities of a test of an emotional state with a test of a
personality or ability trait. For both tests, we would expect that items within each test given
at the same time should correlate with each other. That is, the tests should be internally
consistent. But if a test of mood state shows reliability over time (stability), then we question
whether it is in fact a test of mood. Similarly, a test of intellectual ability should be internally
232 7 Classical Test Theory and the Measurement of Reliability

consistent at any one time, but should also show stability across time. More formally, consider
the score for a particular person, i, on a particular test, j, at a particular time, ok , with a
particular random error, ei jk .
Xi jk = ti j + ok + ei jk .
For two parallel tests at the same time, the time component drops out and the expected
score is just
Xi jk = ti j + ei jk = ti + ei
and reliability will be
σt2 σt2
rxx = = . (7.36)
σX2 σt2 + σe2
But, if the tests are given at different times, there might be an effect of time (practice,
learning, maturation) as well as an interaction of true score by time (people respond to the
different occasions differently.) Thus, the variance of an observation (not specifying time)
will be
σX2 = σt2 + σo2 + σto2 + σe2 .
The correlation of the test at time 1 with that at time 2 standardizes the observations at
both times and thus removes any mean change across time. However, the interaction of time
with true score remains and thus:
σt2 σt2
rxx = 2
= 2 . (7.37)
σX σt + σto2 + σe2

In that the test-restest correlation reflects an additional variance component in the denomi-
nator (the trait by time interaction), it can normally be expected to be less than the reliability
at one time. Two different examples of this effect emphasize the difference between measures
of emotional states versus cognitive traits. In a short term study of the effect of a movie
manipulation on mood and emotional state, tense arousal showed a momentary reliability
of .65 with a 30 minute temporal stability of .28 (Rafaeli and Revelle, 2006). Indeed, four
mood measures with an average internal consistency reliability of .84 (ranging from .65 to
.92) had an average temporal stability over 30 minutes of just .50 (ranging from .28 to .68).
Indeed, if a mood measure shows any stability across intervals as short as one or two days
it is probably no longer a measure of temporary mood! However, we would expect stability
in traits such as intellectual ability. In a longitudinal study with a 66 year interval, ability
test scores show an amazing amount of temporal stability of .66 with an estimated short term
test-retest reliability of .90 (Deary et al., 2004).

7.4.2 Intraclass correlations and the reliability of ratings across


judges

The components of variance approach associated with generalizability theory is particularly


appropriate when considering the reliability of multiple raters or judges. By forming appro-
priate ratios of variances, various intraclass correlation coefficients may be found (Shrout
and Fleiss, 1979). The term intraclass is used because judges are seen as indistinguishable
members of a “class”. That is, there is no logical way of distinguishing them.
7.4 Estimation of reliability 233

Consider the problem of a study using coders (judges) to assess some construct (e.g., the
amount of future orientation that each person reports in a set of essays, or the amount of
anxiety a judge rates in subject based upon a 30 second video clip). Suppose that there are
100 subjects to be rated but a very limited number of judges. The rating task is very difficult
and it is much too much work for any one person to judge all the subjects. Rather, some
judges will rate some essays and other judges will rate other essays. What is the amount of
reliable subject variance estimated by the set of available judges? To answer this question,
one can take a subset of the subjects and have all of the judges rate just those targets. From
the analysis of the components of variance in those ratings (the generalizability study), it is
possible to estimate the amount of reliable variance in the overall set of ratings. The ICC
function calculates intraclass correlations by taking advantage of the power of R to use the
output from one function as input to another function. That is, ICC calls the aov function
to do an analysis of variance and then just organizes the mean square estimates from that
function to calculate the appropriate intraclass correlations and their confidence intervals
(Shrout and Fleiss, 1979).
For example, six subjects are given some score by six different judges (Table 7.10 and
Figure 7.7. Judges 1 and 2 give identical ratings, Judges 3 and 4 agree with Judges 1 and 2
in the relative ratings, but disagree in terms of level (Judge 3) or variance (Judge 4). Finally,
Judges 5 and 6 differ from the first four judges in their rank orders and differ from each other
in terms of their mean and variance. ICC reports the variance between subjects (MSb ), the
variance within subjects (MSw ), the variances due to the judges (MS j ), and the variance due
to the interaction of judge by subject (MSe ). The variance within subjects is based upon the
pooled The reliability estimates from this generalizability analysis will depend upon how the
scores from the judges are to be used in the decision analysis.
The next three equations are adapted from Shrout and Fleiss (1979) who give In a very
thorough discussion of the ICC as it is used in ratings and discuss six different ICCs and
formulas for their confidence intervals. Another useful discussion is by McGraw and Wong
(1996) and an errata published six months later.
ICC(1,1) : Each target is rated by a different judge and the judges are selected at random.
This is a one-way ANOVA fixed effects model where the judge effect is part of the error term
and is found by
MSb − MSw
ICC(1,1) = .
MSb + (n j − 1)MSw
ICC(1,1) is sensitive to differences in means and variances between raters and is a measure of
absolute agreement. The interaction of rater by judge is included in the error term. Compare
the results for Judges 1 and 2 versus 1 and 3. Although the variances are identical, because
the mean for Judge 3 is 5 points higher the Judge 1, ICC(1,1) for these two judges is actually
negative.
ICC(2,1) : A random sample of k judges rate the targets. The measure is one of absolute
agreement in the ratings. Mean differences in judges as well as the judge by target interaction
will affect the scores. Defined as
MSb − MSe
ICC(2,1) = .
MSb + (n j − 1)MSe + n j (MS j − MSe )/n

Because ICC(2,1) has a smaller residual error term (MSe ) it will usually, but not always be
greater than ICC(1,1) (but see the analysis for J1 and J5).
234 7 Classical Test Theory and the Measurement of Reliability

Table 7.10 The Intraclass Correlation Coefficient (ICC) measures the correlation between multiple
observers when the observations are all of the same class. It is found by doing an analysis of variance
to identify the effects due to subjects, judges, and their interaction. These are combined to form the
appropriate ICC. There are at least six different ICCs, depending upon the type of generalization that
is to be made. See Table 7.11 for results taken from these data.

> Ratings

J1 J2 J3 J4 J5 J6
1 1 1 6 2 3 6
2 2 2 7 4 1 2
3 3 3 8 6 5 10
4 4 4 9 8 2 4
5 5 5 10 10 6 12
6 6 6 11 12 4 8

> describe(Ratings,ranges=FALSE,skew=FALSE)
var n mean sd se
J1 1 6 3.5 1.87 0.76
J2 2 6 3.5 1.87 0.76
J3 3 6 8.5 1.87 0.76
J4 4 6 7.0 3.74 1.53
J5 5 6 3.5 1.87 0.76
J6 6 6 7.0 3.74 1.53

> print(ICC(Ratings),all=TRUE)
$results
type ICC F df1 df2 p lower bound upper bound
Single_raters_absolute ICC1 0.32 3.84 5 30 0.01 0.04 0.79
Single_random_raters ICC2 0.37 10.37 5 25 0.00 0.09 0.80
Single_fixed_raters ICC3 0.61 10.37 5 25 0.00 0.28 0.91
Average_raters_absolute ICC1k 0.74 3.84 5 30 0.01 0.21 0.96
Average_random_raters ICC2k 0.78 10.37 5 25 0.00 0.38 0.96
Average_fixed_raters ICC3k 0.90 10.37 5 25 0.00 0.70 0.98

Number of subjects = 6 Number of Judges = 6

$summary
Df Sum Sq Mean Sq F value Pr(>F)
subs 5 141.667 28.333 10.366 1.801e-05 ***
ind 5 153.000 30.600 11.195 9.644e-06 ***
Residuals 25 68.333 2.733
---
Signif. codes: 0 ^O***~
O 0.001 ^O**~
O 0.01 ^
O*~
O 0.05 ^
O.~
O 0.1 ^
O ~
O 1

ICC(3,1) : A fixed set of k judges rate each target. Mean differences between judges are
removed. There is no generalization to a larger population of judges.
MSb − MSe
ICC(3,1) =
MSb + (n j − 1)MSe

By removing the mean for each judge, ICC(3,1) is sensitive to variance differences between
judges (e.g., Judges 4 and 6 have four times the variance of Judges 1...3 and 5).
7.4 Estimation of reliability 235

Ratings by Judges

12
6

5 5
10
3

3 4 6
8

2
Ratings

6 6 1 3 5 1
6

5 5 3

4 4 2 6 4
4

3 3 1

2 2 1 4 2
2

1 1 2

J1 J2 J3 J4 J5 J6

Raters

Fig. 7.7 When estimating the reliability of raters, it is important to consider what kind of reliability
is relevant. Although clearly the correlation between raters J1 ... J4 is 1 between all 4 raters, the raters
differ in their leniency as well as their variability. The Intraclass correlation considers different types
of reliability.

Table 7.11 Sources of variances and the Intraclass Correlation Coefficient.


(J1, J2) (J3, J4) (J5, J6) (J1, J3) (J1, J5) (J1 ... J3) (J1 ... J4) (J1 ... J6)
Variance estimates
MSb 7 15.75 15.75 7.0 5.2 10.50 21.88 28.33
MSw 0 2.58 7.58 12.5 1.5 8.33 7.12 7.38
MS j 0 6.75 36.75 75.0 0.0 50.00 38.38 30.60
MSe 0 1.75 1.75 0.0 1.8 0.00 .88 2.73
Intraclass correlations
ICC(1,1) 1.00 .72 .35 -.28 .55 .08 .34 .32
ICC(2,1) 1.00 .73 .48 .22 .53 .30 .42 .37
ICC(3,1) 1.00 .80 .80 1.00 .49 1.00 .86 .61
ICC(1,k) 1.00 .84 .52 -.79 .71 .21 .67 .74
ICC(2,k) 1.00 .85 .65 .36 .69 .56 .75 .78
ICC(3,k) 1.00 .89 .89 1.00 .65 1.00 .96 .90
236 7 Classical Test Theory and the Measurement of Reliability

Then, for each of these three cases, if reliability is to be estimated for the average rating of
multiple judges? In that case, each target gets the the average of k ratings and the reliability
are increased by the Spearman Brown adjusted reliability.

7.4.3 Generalizability theory: reliability over facets

The intraclass correlation analysis of the reliability of ratings in terms of components of


variance associated with raters, targets, and their interactions, can be extended to other
domains. That is, the analysis of variance approach to the measurement of reliability focuses
on the relevant facets in an experimental design. If ratings are nested within teachers whom
are nested within schools, and are given at different times, then all of these terms and their
interactions are sources of variance in the ratings. First do an analysis of variance in the
generalizability study to identify the variance components. Then determine which variance
components are relevant for the application in the decision study in which one is trying to
use the measure (Cronbach et al., 1972). Similarly, the components of variance associated
with parts of a test can be analyzed in terms of the generalizability of the entire test.

7.4.4 Reliability of a composite test

If a test is made up of subtests with known reliability, it is possible to find the reliability
of the composite in terms of the known reliabilities and the observed correlations. Early
discussions of this by Cronbach et al. (1965) considered the composite reliability of a test
to be a function of the reliabilities of each subtest, ρxxi (for which Cronbach used αi ), the
subtest variance, σi2 , and the total test variance, σX2 ,

Σ (1 − ρxxi )σi2
αs = 1 − . (7.38)
σX2

The example in Table 7.5 had three groups with reliabilities of .857, .857 and .823, total
variance of 53.2, and variance/covariances of
G1 G2 G3
G1 21.0 4.8 2.4
G2 4.8 11.2 1.6
G3 2.4 1.6 3.4
The composite α is therefore

(1 − .857)21.0 + (1 − .857)11.2 + (1 − .823)3.4


1− = .90
53.2
which is the same value as found for ωt and is the correct value given the known structure of
this problem. However, the items in the example all have equal correlations within groups. For
the same reason that α underestimates reliability, αs will also underestimate the reliability if
the items within groups do not have identical correlations. αs is preferred to α for estimating
the reliability of a composite, but is still not as accurate as ωt . Although Cronbach et al.
7.4 Estimation of reliability 237

(1965) used αi as an estimate for the subtest reliability, ρxxi , it is possible to use a better
estimate of reliability for the subtests. ωt can, of course, be found using the omega function or
can be found by using a “phantom variable” approach (Raykov, 1997) in a structural equation
solution using sem (Chapter 10).

7.4.5 Reliability of a difference score

It is sometimes useful to create a score made up of the difference between two tests. Just as
the variance of a composite of X1 and X2 is the sum of the variances and twice the covariance of
X1 and X2 (Equation 7.39, so is the variance of a difference, except in this case the covariance
is negative. The reliability of this difference score, r∆ ∆ , may be found by the ratio of the
reliable variance to the total variance and is a function of the reliable variances for the two
components as well as their intercorrelation:

σ12 rxx1 + σ22 rxx2 − 2σ12 σ22 r12


r∆ ∆ = . (7.39)
σ12 + σ22 − 2σ12 σ22 r12

This is equivalent to finding the reliability of a sum of two tests (Equation 7.38 but now
the tests are negatively rather than positively correlated. As the correlation between the two
tests increases, the reliability of their differences decreases.

Fig. 7.8 The reliability of composites or differences of two tests depends upon the reliability of the
tests as well as their intercorrelation. The reliability of the composite increases as the tests are more
correlated (left hand panel). However, the reliability of a difference decreases as the tests are more
correlated (right hand panel).

Reliability of a composite Reliability of a difference


1.0

1.0

0.9 0.9
0.8

0.8

0.8 0.8
0.7 0.7
Reliability

Reliability
0.6

0.6

0.6 0.6
0.5 0.5
0.4

0.4

0.4 0.4
0.2

0.2
0.0

0.0

-0.2 0.0 0.2 0.4 0.6 0.8 1.0 -0.2 0.0 0.2 0.4 0.6 0.8 1.0

Intercorrelation Intercorrelation
238 7 Classical Test Theory and the Measurement of Reliability

7.5 Using reliability to estimate true scores

Although developed to correct for attenuation of correlations due to poor measurement,


another important use of reliability is to estimate a person’s true or domain score given their
observed score. From Equation 7.40 it is clear that

σt2
tˆ = bt.x x = x = ρxt2 x = rxx x. (7.40)
σx2

That is, expected true scores will regress towards the mean observed score as a function
of 1 - rxx . This regression to the mean is the source of great confusion for many people
because it implies that the best estimate for a person’s retest score is closer to the mean of
the population than is the observed score. Real life examples of this are seen in sports and
finance where a baseball player who does well one year can be expected to do less well the
next year just as a financial advisor who does well one year will probably not do as well
the next year (Bernstein, 1996; Stigler, 1999). Perhaps the classic example is that of flight
instructors who observe that praising good performance results in decreases in performance
while punishing poor performance results in improved performance (Tversky and Kahneman,
1974).
Knowing the reliability also allows for confidence intervals around the estimated true score.
For the proportion of error variance in an estimate is just 1- rxx, and thus the standard error
of measurement is � �
σe = σx rxt2 = σx 1 − rxx (7.41)
Because the estimated true score is closer to the mean than the observed score, the confidence
intervals of true score will be assymetric around the observed score. Consider three observed
scores of +2, 0, and -2 and their estimated true scores and confidence intervals as a function
of the test reliability.
All of the approaches discussed in this chapter have considered the reliability of a measure
in terms of the variance of observed scores and the variance of latent scores. Although the
effect of items is considered in terms of how the items intercorrelate, items are assumed to
sampled at random from some universe of items. Reliability is a characteristic of the test
and of a sample of people. A test’s reliability will be increased if the true score variance is
increased, but this can be done in a somewhat artificial manner. Consider a class of first year
graduate students in psychometrics. An exam testing their knowledge will probably not be
very reliable in that the variance of the students’ knowledge is not very great. But if several
first year undergraduates who have not had statistics and several psychometricians are added
to the group of test takers, suddenly the reliability of the test is very high because the between
person variance has increased. But the precision for evaluating individual differences within
the set of graduate students has not changed. What is needed is a model of how an individual
student responds to an individual item rather than how a group of students responds to a
group of items (a test). In the next chapter we consider how these ideas can be expanded
to include a consideration of how individual items behave, and how it is possible to get an
estimate of the error associated with the measure for a single individual.
7.5 Using reliability to estimate true scores 239

Fig. 7.9 Confidence intervals vary with reliability, For observed scores of -2, 0, and 2, the estimated
true scores and confidence intervals vary by reliability. The confidence range is symmetric about the
estimated true score. Even with reliability = .90,√the confidence�intervals for a true score with an
observed score of 2 range from 1.18 to 2.42. (1.8 − 1 − .9 to 1.8 + (1 − .9)

Confidence intervals of true score estimates

1.0
0.8
0.6
Reliability

0.4
0.2
0.0

-3 -2 -1 0 1 2 3

True score
Chapter 8
The “New Psychometrics” – Item Response
Theory

Classical test theory is concerned with the reliability of a test and assumes that the items
within the test are sampled at random from a domain of relevant items. Reliability is seen
as a characteristic of the test and of the variance of the trait it measures. Items are treated
as random replicates of each other and their characteristics, if examined at all, are expressed
as correlations with total test score or as factor loadings on the putative latent variable(s) of
interest. Characteristics of their properties are not analyzed in detail. This led Mellenbergh
(1996) to the distinction between theories of tests (Lord and Novick, 1968) and a theories
of items (Lord, 1952; Rasch, 1960). The so-called “New Psychometrics” (Embretson and
Hershberger, 1999; Embretson and Reise, 2000; Van der Linden and Hambleton, 1997) is a
theory of how people respond to items and is known as Item Response Theory or IRT . Over
the past twenty years there has been explosive growth in programs that can do IRT, and
within R there are at least four very powerful packages: eRm (Mair and Hatzinger, 2007),
ltm Rizopoulos (2006), lme4 (Doran et al., 2007) and MiscPsycho, (Doran, 2010). Additional
packages include mokken (van der Ark, 2010) to do non-metric IRT and plink (Weeks, 2010)
to link multiple groups together. More IRT packages are being added all of the time.
In the discussion of Coombs’ Theory of Data the measurement of attitudes and abilities
(2.9) were seen as examples of comparing an object (an item) to a person. The comparison
was said to be one of order (for abilities) or of distance (for attitudes). The basic model was
that for ability there is a latent value for each person, θi and a latent difficulty or location1
δ j for each item. The probability of a particular person getting a specific ability item correct
was given in Equation 2.14 and is

prob(correct|θ , δ ) = f (θ − δ ) (8.1)

while for an attitude, the probability of item endorsement is

prob(endorsement|θ , δ ) = f (|θ − δ |) (8.2)

and the question becomes what are the functions that best represents the data.

1 The original derivation of IRT was in terms of measuring ability and thus the term item difficulty was
used. In more recent work in measuring quality of life or personality traits some prefer the term item
location. Although I will sometimes use both terms, most of the following discussion uses difficulty as
the term used to describe the location parameter.

241
242 8 The “New Psychometrics” – Item Response Theory

At the most general level, the probability of being correct on an item will be a monotoni-
cally increasing function of ability while the probability of endorsing an attitude item will be
a single peaked function of the level of that attitude. Although the distinction between ability
and attitude scaling is one of ordering versus one of distance seems clear, it is unclear which
is the appropriate model for personality items. (Ability items are thought to reflect maximal
competencies while personality items reflect average or routine thoughts, feelings and behav-
iors.) Typical analyses of personality items assume the ordering model (Equation 8.1) but
as will be discussed later (8.5.2) there are some who recommend the distance model (Equa-
tion 8.2). The graphic representation of the probability of being correct or endorsing an item
shows the trace line of the probability plotted as a function of the latent attribute (Lazars-
feld, 1955; Lord, 1952). Compare the hypothetical trace lines for ability items (monotonically
increasing as in Figure 2.9) with that of attitude items (single peaked as in Figure 2.10).
A trace line is also called an item characteristic curve or icc which should not be confused
with an Intra-Class Correlation (ICC ). Two requirements for the function should be that the
trait (ability or attitude) can be unbounded (there is always someone higher than previously
measured, there is always a more difficult item) and that the response probability is bounded
(0,1). That is −∞ < θ < ∞, −∞ < δ < ∞ and 0 < p < 1. The question remains, what are the
best functions to use?

8.1 Dichotomous items and monotonic trace lines: the measurement


of ability

An early solution to this question for the ability domain was proposed by Guttman (1950)
and was a deterministic step function with no model of error (Equation 2.16). However, a
person’s response to an item is not a perfect measure of their underlying disposition and
fluctuates slightly from moment to moment. That is, items do have error. Thus two common
probabilistic models are the cumulative normal (2.17) and the logistic model (2.18). Although
these models seem quite different, with the addition of a multiplicative constant (1.702) these
two models appear to be almost identical over the range from -3 to 3 (Figure 2.8) and because
the logistic function is easier to manipulate, many derivations have been done in terms of the
logistic model. However, as Samejima (2000) has noted, even with identical item parameters,
these two models produce somewhat different orderings of subjects. Even so, it is probably
clearer to first discuss the logistic function and then consider some alternative models.

8.1.1 Rasch Modeling - one parameter IRT

If all items are assumed to equally good measures of the trait, but to differ only in their
difficulty/location, then the one parameter logistic (1PL) Rasch model (Rasch, 1960) is the
easiest to understand:
1
p(correcti j |θi , δ j ) = . (8.3)
1 + eδ j −θi
That is, the probability of the ith person being correct on (or endorsing) the jth item is a
logistic function of the difference between the person’s ability (latent trait) (θi ) and the item
8.1 Dichotomous items and monotonic trace lines: the measurement of ability 243

difficulty (or location) (δ j ). The more the person’s ability is greater than the item difficulty,
the more likely the person is to get the item correct. To estimate a person’s ability we need
only know the probability of being correct on a set of items and the difficulty of those items.
Similarly, to estimate item difficulty, we need only know the probability of being correct on
an item and the ability of the people taking the item. Wright and Mok (2004) liken this to
the problem of a comparing high jumpers to each other in terms of their ability to jump over
fences of different heights. If one jumper is more likely to jump over a fence of a particular
height than is another, or equally likely to clear a higher fence than the other, it is the ratio
of likelihoods that is most useful in determining relative ability between people as well as
comparing a person to an item.
The probability of missing an item, q, is just 1 - p(correct) and thus the odds ratio of
being correct for a person with ability, θi , on an item with difficulty, δ j is
1 1
p p δ j −θi δ −θ
1+e j i 1
ORi j = = = 1+e 1 = = = eθi −δ j . (8.4)
1− p q 1− δ j −θi
e
δ j −θi
eδ j −θi
1+e δ −θ
1+e j i

That is, the odds ratio will be a exponential function of the difference between a person’s
ability and the task difficulty. The odds of a particular pattern of rights and wrongs over n
items will be the product of n odds ratios
n n
ORi1 ORi2 . . . ORin = ∏ eθi −δ j = enθi e− ∑ j=1 δ j . (8.5)
j=1

Substituting P for the pattern of correct responses and Q for the pattern of incorrect re-
sponses, and taking the logarithm of both sides of equation 8.5 leads to a much simpler
form:
n
P
ln = nθi + ∑ δ j = n(θi + δ̄ ). (8.6)
Q j=1

That is, the log of the pattern of correct/incorrect for the ith individual is a function of the
number of items * (θi - the average difficulty). Specifying the average difficulty of an item as
δ̄ = 0 to set the scale, then θi is just the logarithm of P/Q divided by n or, conceptually, the
average logarithm of the p/q
ln QP
θi = . (8.7)
n
Similarly, the pattern of the odds of correct and incorrect responses across people for a
particular item with difficulty δ j will be
N
P N
OR1 j OR2 j . . . ORn j = = ∏ eθi −δ j = e∑i=1 (θi )−Nδ j (8.8)
Q i=1

and taking logs of both sides leads to


N
P
ln = ∑ (θi ) − Nδ j . (8.9)
Q i=1
244 8 The “New Psychometrics” – Item Response Theory

Letting the average ability θ̄ = 0 leads to the conclusion that the difficulty of an item for all
subjects, δ j , is the logarithm of Q/P divided by the number of subjects, N,

ln QP
δj = . (8.10)
N
That is, the estimate of ability (Equation 8.7) for items with an average difficulty of 0 does not
require knowing the difficulty of any particular item, but is just a function of the pattern of
corrects and incorrects for a subject across all items. Similarly, the estimate of item difficulty
across people ranging in ability, but with an average ability of 0 (Equation 8.10) is a function
of the response pattern of all the subjects on that one item and does not depend upon
knowing any one person’s ability. The assumptions that average difficulty and average ability
are 0 are merely to fix the scales. Replacing the average values with a non-zero value just
adds a constant to the estimates.
The independence of ability from difficulty implied in equations 8.7 and 8.10 makes es-
timation of both values very straightforward. These two equations also have the important
implication that the number correct (n p̄ for a subject, N p̄ for an item) is monotonically, but
not linearly related to ability or to difficulty. That the estimated ability is independent of
the pattern of rights and wrongs but just depends upon the total number correct is seen as
both a strength and a weakness of the Rasch model. From the perspective of fundamental
measurement, Rasch scoring provides an additive interval scale: for all people and items, if
θi < θ j and δk < δl then p(x|θi , δk ) < p(x|θ j , δl ). But this very additivity treats all patterns
of scores with the same number correct as equal and ignores potential information in the
pattern of responses (see the discussion of the normal ogive model at at 8.1.2).
Consider the case of 1000 subjects taking the Law School Admissions Exam (the LSAT).
This example is taken from a data set supplied by Lord to Bock and Lieberman (1970),
which has been used by McDonald (1999) and is included in both the ltm and the psych
package. The original data table from Bock and Lieberman (1970) reports the response
patterns and frequencies of five items (numbers 11 - 15) of sections 6 and 7. Section 6 were
highly homogeneous Figure Classification items and Section 7 were items on Debate. This
table has been converted to two data sets: lsat6 and lsat7 using the table2df function.
Using the describe function for descriptive statistics and then the rasch function from ltm
gives the statistics shown in Table 8.1. The presumed item characteristic functions and the
estimates of ability using the model are seen in Figure 8.1. Simulated Rasch data can be
generated using the sim.rasch function in psych or the mvlogis function in ltm.
What is clear from Table 8.1 is that these items are all very easy (percent correct endorse-
ments range from .55 to .92 and the Rasch scaled values range from -1.93 to -.20). Figure 8.1
shows the five parallel trace lines generated from these parameters as well as the distribution
of subjects on an underlying normal metric. What is clear when comparing the distribution
of ability estimates with the item information functions is that the test is most accurate in
ordering those participants with the lowest scores.
Because the person and item parameters are continuous, but the response patterns are
discrete, neither person nor item will fit the Rasch model perfectly. The residual for the
pairing of a particular person with ability θi for a particular item with difficulty δ j will be
the difference between the observed response, xi j , and the modeled response, pi j

1
xi j − pi j = xi j − .
1 + eθi −δ j
8.1 Dichotomous items and monotonic trace lines: the measurement of ability 245

Item Characteristic Curves

1.0
Q3
Q2

0.8
Q4

0.6
Probability
Q5
0.4
0.2

Q1
0.0

-4 -2 0 2 4

Ability

Item Information Curves

Q5
0.6
Information

0.4

Q4

Q1
0.2

Q2
0.0

Q3

-4 -2 0 2 4

Ability

Kernel Density Estimation for Ability Estimates


0.6
0.5
0.4
Density

0.3
0.2
0.1
0.0

-2 -1 0 1

Ability

Fig. 8.1 The rasch function estimates item characteristic functions, item information functions, as
well as the distribution of ability for the various response patterns in the lsat6 data set. All five items
are very easy for this group. Graphs generated by plot.rasch and plot.fscores in the ltm package.
Because the Rasch model assumes all items are equally discriminating, the trace lines and the resulting
item information functions are all identically shaped. The item information is just the first derivative
of the item characteristic curve.
246 8 The “New Psychometrics” – Item Response Theory

Table 8.1 Given the 1000 subjects in the LSAT data set (taken from Bock and Lieberman (1970),
the item difficulties may be found using the Rasch model in the ltm package. Compare these difficulties
with the item means as well as the item thresholds, τ. The items have been sorted into ascending order
of item difficulty using the order and colMeans functions.

data(bock)
> ord <- order(colMeans(lsat6),decreasing=TRUE)
> lsat6.sorted <- lsat6[,ord]
> describe(lsat6.sorted)
> Tau <- round(-qnorm(colMeans(lsat6.sorted)),2) #tau = estimates of threshold
> rasch(lsat6.sorted,constraint=cbind(ncol(lsat6.sorted)+1,1.702))

var n mean sd median trimmed mad min max range skew kurtosis se
Q1 1 1000 0.92 0.27 1 1.00 0 0 1 1 -3.20 8.22 0.01
Q5 2 1000 0.87 0.34 1 0.96 0 0 1 1 -2.20 2.83 0.01
Q4 3 1000 0.76 0.43 1 0.83 0 0 1 1 -1.24 -0.48 0.01
Q2 4 1000 0.71 0.45 1 0.76 0 0 1 1 -0.92 -1.16 0.01
Q3 5 1000 0.55 0.50 1 0.57 0 0 1 1 -0.21 -1.96 0.02

> Tau
Q1 Q5 Q4 Q2 Q3
-1.43 -1.13 -0.72 -0.55 -0.13

Call:
rasch(data = lsat6.sorted, constraint = cbind(ncol(lsat6.sorted) +
1, 1.702))

Coefficients:
Dffclt.Q1 Dffclt.Q5 Dffclt.Q4 Dffclt.Q2 Dffclt.Q3 Dscrmn
-1.927 -1.507 -0.960 -0.742 -0.195 1.702

Because pi j is a binomial probability it will have variance pi j (1 − pi j ). Thus, the residual


expressed as a z score will be
xi j − pi j
zi j = �
pi j (1 − pi j )
and the square of this will be a χ 2 with one degree of freedom. Summing this over all n items
for a subject and dividing by n yields a goodness of fit statistic, outfit, which represents the
“outlier sensitive mean square residual goodness of fit statistic” for the subject, the person
outfit, (Wright and Mok, 2004, p 13)
n
ui = ∑ z2i j /n (8.11)
j=1

with the equivalent item outfit based upon the sum of misfits across people for a particular
item
N
u j = ∑ z2i j /N. (8.12)
i=1

Because the outfit will be most responsive to unexpected deviations (missing an easy item
for a very able person, or passing a difficult item for a less able person), the infit statistic is
the “information weighed mean square residual goodness of fit statistic” (Wright and Mok,
8.1 Dichotomous items and monotonic trace lines: the measurement of ability 247

2004, p 13). The weight, Wi j is the variance Wi j = pi j (1 − pi j ) and the person infit is
n
∑ z2i jWi j
j=1
vi = n (8.13)
∑ Wi j
j=1

(Wright and Mok, 2004). The infit and outfit statistics are available in the eRm package by
Mair and Hatzinger (2009).
In addition to the lsat6 and lsat7 data sets, a sample data set that has been very well
documented includes Bond’s Logical Operations Test (BLOT ) (Bond, 1995) and the Piage-
tian Reasoning Task (PRTIII by Shayer et al. (1976)) the data for which may be found
online at http://homes.jcu.edu.au/~edtgb/book/data/Bond87.txt or in the introduc-
tory text by Bond and Fox (2007). By copying the data into the clipboard and using the
read.clipboard.fwf function to read a fixed width formatted file, we are able to compare
the results from the ltm and eRm packages with some of the commercial packages such
as WINSTEPS (Linacre, 2005). It is important to note that the estimates of these three
progams are not identical, but are rather linear transforms of each other. Thus, the qnorm
of the colMeans, and the rasch estimates from ltm will match the RM estimates from eRm
if the slopes are constrained to be one, and an additive constant is added. The RM estimates
differ from the WINSTEPS merely in their sign, in that RM reports item easiness estimates
while WINSTEPS reports item difficulty. Finally, RM by default forces the mean difficulty
to be 0.0, while rasch does not. This agreement is also true for the person estimates, which
are just inverse logistic, or logit, transforms of the total score. This is seen by some the
power of the Rasch model in that it is monotonic with total score. In fact, the θ estimate is
just a logit transformation of total score expressed as a percent correct, P̄, and the percent
incorrect, Q̄ = 1 − P̄
1 P̄i
θi = −ln( − 1) = ln( ). (8.14)
P̄i Q̄i
That is, for complete data sets, Rasch scores are easy to find without the complexity of various
packages. But the power of the Rasch (and other IRT models) is not for complete data, but
when data are missing or when items are tailored to the subject (see ??). Additionally, by
expressing person scores on the same metric as item difficulties, it is easy to discover when
the set of items are too difficult (e.g., the PRT) or too easy (e.g., the BLOT) for the subjects
(Figure 8.2).
Good introductions to the Rasch model include a chapter by Wright and Mok (2004), and
texts by Andrich (1988) and Bond and Fox (2007). Examples of applied use of the Rasch
model for non-ability items include developing short forms for assessing depression (Cole
et al., 2004) and the impairment in one’s daily life associated with problems in vision (Denny
et al., 2007). For depression, an item that is easy to endorse is “I felt lonely” while “People
were unfriendly” is somewhat harder to endorse, and “I felt my life had been a failure” is
much harder to endorse (Cole et al., 2004). For problems with one’s vision, many people will
have trouble reading normal size newsprint but fewer will have difficulty identifying money
from a wallet, and fewer yet will have problems cutting up food on a plate (Denny et al.,
2007).
248 8 The “New Psychometrics” – Item Response Theory

Person-Item Map for the BLOT Person-Item Map for the PRT

Person Person
Parameter Parameter
Distribution Distribution

V7 V7
V13 V13
V23 V37 V37
V23
V6 V6
V28 V28 ttx V44
ttx
V44
V21 V21
V2 V2
V3 V3 V46 V46
V15 V15
V8 V8 V43 V43
V34 V34
V30 V30
V35 V35 V40 V40
V17 V17
V36 V36 V48
V11 V48
V11
V19 V19
V5 V5 V41 V41
V10 V10
V32 V32
V12 V45 V45
V12
V25 V25
V24 V24 V47 V47
V18 V18
V20 V20
V26 V26 V39 V39
V4 V4
V27 V27 V42 V42
V9 V9
V14 V14
V16 V16 V49 V49
V31 V31
V33 V33 V38
V29 V38
V29
V22 V22

-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
Latent Dimension Latent Dimension

Fig. 8.2 The plotPImap function in the eRm package plots the distribution of the the person scores
on the same metric as the item difficulties. While the hardest BLOT item is easier than the ability
of 22% of subjects, the PRT items are much harder and all except two items are above 30% of the
subjects. In that the subjects are the same for both tests, that the distribution of latent scores is not
equivalent suggests problems of precision.

8.1.2 The normal ogive – another one parameter models

The Rasch model is based upon a logistic function rather than the cumulative normal. As
shown earlier, with the multiplication of a constant, these functions are practically identical,
and the logistic is relatively easy to calculate. However, with modern computational power,
ease of computation does not compensate for difficulties of interpretation. Most people would
prefer to give more credit to getting a difficult item correct than getting an easy item correct.
But the logistic solution will give equal credit to any response pattern with the same number
of correct items. (To the proponents of the Rasch model, this is one of its strengths, in that
Rasch solutions are monotonic with total score.) The cumulative normal model (also known
as the normal ogive or the one parameter normal, 1PN , model) gives more credit for passing
higher items:
� θ −δ
1 u2
p(correct|θ , δ ) = √ e− 2 du (8.15)
2π − inf
where u = θ − δ . By weighting the squared difference between a person and an item, greater
weight is applied to missing extremely easy and passing extremely hard items than missing
or passing more intermediate items. This will lead to scores that depend upon the particular
pattern of responses rather than just total score. The difference between these two models
may be seen in Figure 8.3 which compares logistic and normal ogive models for the 32 possible
response patterns of five items ranging in difficulty from -3 to 3 (adapted from Samejima
(2000). Examine in particular the response patterns 2-7 where the score estimate increases
even though the number correct remains 1.
8.1 Dichotomous items and monotonic trace lines: the measurement of ability 249

Another strength of the cumulative normal or normal ogive model is that it corresponds
directly to estimates derived from factor analysis of the tetrachoric correlation matrix (Mc-
Donald, 1999). This is more relevant when two parameter models are discussed, for then the
factor loadings can be translated into the item discrimination parameter (8.3).
The cumulative normal model, however, has an interesting asymmetry, in that for a pattern
of all but one wrong item, the difficulty of the one passed item determines the ability estimate,
while for a set of items that are all passed except for one failure, the difficulty of the failed
item determines the ability estimate. This leads to the non-intuitive observation that getting
the hardest four of five items correct shows less ability than getting all but the hardest item
correct (Samejima, 2000). Consider five items with difficulties of -3, -1.5, 0, 1.5 and 3. These
five items will lead to the 32 response patterns although scores can only be found for the
cases of at least one right or one wrong. IRT estimates of score using the cumulative normal
model range from -2.28 to 2.28 and correlate .93 with total correct. IRT estimates using the
logistic function also range from -2.28 to 2.28 and correlate 1.0 with total correct. However,
the logistic does not discriminate between differences in pattern within the same total score
while the cumulative normal does (data in Figure 8.3 adapted from Samejima (2000)), To
Rasch enthusiasts, this is a strength of the Rasch model, but to others, a weakness. An
alternative, discussed by Samejima (1995, 2000), is to add an acceleration parameter to the
logistic function which weights harder items more than easy items. This generalization is
thus one of the many two and three parameter models that have been proposed.

8.1.3 Parameter estimation

The parameters of classical test theory, total score and item-whole correlations are all easy
to find. The parameters of IRT, however, require iterative estimation, most typically using
the principal of maximimum likelihood and the assumption of local independence. That is,
for a pattern of responses, v, made up of individual item responses u j with values of 0 or 1,
the probability, P, of passing an item and the probability, 1 − P = Q, of failing an item has a
probability distribution function of U j of a response on the jth item that depends upon the
subject’s ability, θi , and characteristics of the item difficulty, δ j :

f j (u j |θ ) = Pj (θi )u j Q j (θi )1−u j (8.16)

then, with the assumption of local independence, the probability of data pattern u of successes
and failures is the product of the individual probabilities
n
L( f ) = ∏ Pj (θi )u j Q j (θi )1−u j (8.17)
j=1

(Lord and Novick, 1968) and, by taking logarithms


n
lnL( f ) = ∑ [x j lnPj + (1 − x j )ln(Q j )] (8.18)
j=1

(McDonald, 1999, p 281).


250 8 The “New Psychometrics” – Item Response Theory

Normal and Logistic estimates of score by response pattern

3
2
Normal or Logistic estimates

1
0
-1
-2
-3

00000
10000
01000
00100
00010
01100
00001
11000
10100
01010
00110
00101
10010
01001
00011
01110
10001
00111
01101
10110
01011
10011
10101
11010
11100
01111
11001
10111
11011
11101
11110
11111
Fig. 8.3 For 32 response patterns of correct (1) or incorrect (0), the estimated scores from the logistic
function do not discriminate between different response patterns with the same total score. The cu-
mulative normal model does, but has some problematic assymetries (Samejima, 2000). The solid line
represents the cumulative normal scores, the dashed line the logistic estimates. Neither model can find
estimates when the total is either 0 or 5 (all wrong or all correct).

The best estimate for θi is then that value that has the maximum likelihood which may
be found by iterative procedures using the optim function or IRT packages such as ltm or
eRm.

8.1.4 Item information

When forming a test and evaluating the items within a test, the most useful items are the ones
that give the most information about a person’s score. In classic test theory, item information
is the reciprocal of the squared standard error for the item or for a one factor test, the ratio
of the item communality to its uniqueness:
8.1 Dichotomous items and monotonic trace lines: the measurement of ability 251

hj 2
1
Ij = = .
σe j
2 1 − h2j

When estimating ability using IRT, the information for an item is a function of the first
derivative of the likelihood function and is maximized at the inflection point of the icc. The
information function for an item is
[Pj� ( f )]2
I( f , x j ) = (8.19)
Pj ( f )Q j ( f )

(McDonald, 1999, p 285). For the 1PL model, P� , the first derivative of the probability
function Pj ( f ) = 1+e1δ −θ is
eδ −θ
P� = (8.20)
(1 + eδ −θ )2
which is just Pj Q j and thus the information for an item is

I j = Pj Q j . (8.21)

That is, information is maximized when the probability of getting an item correct is the
same as getting it wrong, or, in other words, the best estimate for an item’s difficulty is that
value where half of the subjects pass the item. Since the test information is just the sum of
the item information across items, a test can be designed to provide maximum information
(and the smallest standard error of measurement) at a particular point by having items of
a particular difficulty, or it can be designed to have relatively uniform information across a
range of ability by having items of different difficulties.

8.1.5 Two parameter models

Both the one parameter logistic (1PL) and the one parameter normal ogive (1PN ) models as-
sume that items differ only in their difficulty. Given what we know about the factor structure
of items, this is unrealistic. Items differ not only in how hard they are to answer, they also
differ in how well they assess the latent trait. This leads to the addition of a discrimination
parameter, α, which has the effect of magnifying the importance of the difference between
the subject’s ability and the item difficulty. In the two parameter logistic (2PL) model this
leads to the probability of being correct as
1
p(correcti j |θi , α j , δ j ) = (8.22)
1 + eαi (δ j −θi )
while in the two parameter normal ogive (2PN ) model this is
� α(θ −δ )
1 u2
p(correct|θ , α j , δ ) = √ e− 2 du (8.23)
2π − inf

where u = α(θ − δ ).
252 8 The “New Psychometrics” – Item Response Theory

The information function for a two parameter model reflects the item discrimination pa-
rameter, α,
I j = α 2 Pj Q j (8.24)
which, for a 2PL model is
α 2j
I j = α 2j Pj Q j = . (8.25)
(1 + eα j (δ j −θ j ) )2
The addition of the discrimination parameter leads to both a better fit to the data but
also leads to non-parallel trace lines. Thus, an item can be both harder at low levels of ability
and easier at high levels of ability (Figure 8.4). Indeed, the trace lines for two estimates of θ
will cross over when
αi δi − α j δ j
θi j = (8.26)
αi − α j
(Sijtsma and Hemker, 2000). This leads to a logical problem. By improving the intuitive
nature of the theory (that is to say, allowing items to differ in their discriminability) we have
broken the simple additivity of the model. In other words, with the 1PL model, if one person
is more likely to get an item with a specific level of difficulty correct than is another person,
then this holds true for items with any level of difficulty. That is, items and people have
an additive structure. But, with the addition of the discrimination parameter, rank orders
of the probability of endorsing an item as a function of ability will vary as a function of
the item difficulty. It is this failure of additivity that leads some (e.g., Cliff (1992); Wright
(1977); Wright and Mok (2004)) to reject any generalizations beyond the Rasch model.
For, by violating additivity, the basic principles of fundamental measurement theory are
violated. In a comparison of 20 different IRT models (10 of dichotomous responses, 10 for
polytomous responses) Sijtsma and Hemker (2000) show that only the 1PL and 1PN models
for dichotomous and the RSM and PCM models for polytomous items have this important
property of not having item functions intersect each other. The RSM and PCM models are
discussed below.
The ltm function in ltm estimates the two parameter model and produces the associated
icc and item information curve, iic. Compare Figure 8.5 with the same set of graphs for the
Rasch model 8.1. Although the items have roughly equal slopes and do not intersect, they
do differ in the information they provide.

8.1.6 Three parameter models

The original work on IRT was developed for items where there was no guessing, e.g., getting
the right answer when adding up a column of numbers (Lord, 1952). But, when taking
a multiple choice ability test with n alternatives it is possible to get some items correct by
guessing. Knowing nothing about the material one should be able to get at least 1/n% correct
by random guessing. But guessing is not, in fact, random. Items can differ in how much they
attract guesses for people who know very little about the material and thus a third item
parameter can be introduced, the guessing parameter, γ:
1−γ
p(correcti j |θi , α j , δ j , γ) = γ + (8.27)
1 + eαi (δ j −θi )
8.1 Dichotomous items and monotonic trace lines: the measurement of ability 253

2PL models differing in their discrimination parameter

!=2

1.0
1
0.8 P(x) = !=1
Probability of correct |ability and difficulty

1 + e!(!!!)

! = 0.5
0.6
0.4
0.2
0.0

-4 -2 0 2 4

Ability (logit units)

Fig. 8.4 By adding a discrimination parameter, the simple additivity of the 1PL model is lost. An
item can both be harder for those with low ability and easier with those with high ability (b=2) than
less discriminating items (b = 1, .5). Lines drawn with e.g., curve(logistic(x,b=2)).

(Figure 8.6 upper panel).


Unfortunately, the addition of the guessing parameter increases the likelihood that the
trace lines will intersect and thus increases the non-additivity of the item functioning.

8.1.7 Four parameter models

Some items are so difficult that even with extreme levels of a trait not everyone will respond
to the item. That is, the upper asymptote of the item is not 1. Although Reise and Waller
(2009) have shown that including both a lower (γ) and upper (ζ ) bound to the item response
will improve the fit of the model and can be argued for in clinical assessment of disorders
resulting in extremely rare behaviors, it would seem that that the continued addition of
parameters at some point leads to an overly complicated (but well fitting) model (Figure 8.6
lower panel). The model is
254 8 The “New Psychometrics” – Item Response Theory

Item Characteristic Curves

1.0
Q5

Q4

0.8
0.6
Probability
Q1 Q3

Q2
0.4
0.2
0.0

-4 -2 0 2 4

Ability

Item Information Curves


0.20

Q3

Q1
0.15

Q2
Information

0.10
0.05

Q4
0.00

Q5

-4 -2 0 2 4

Ability

Kernel Density Estimation for Ability Estimates


1.0
0.8
0.6
Density

0.4
0.2
0.0

-2 -1 0 1

Ability

Fig. 8.5 The two parameter solution to the lsat6 data set shows that the items differ somewhat in
their discriminability, and hence in the information they provide. Compare this to the one parameter
(Rasch) solution in Figure 8.1. The item information curve emphasizes that the items are maximally
informative at different parts of the difficulty dimension.
8.1 Dichotomous items and monotonic trace lines: the measurement of ability 255

Probability of correct |ability and difficulty


3PL models differing in guessing and difficulty

1!!
P(x) = ! +

0.8
1 + e!(!!!)
! = 0.2 ! = ! 1

0.4
! = 0.3 ! = 0
! = 0.2 ! = 1
!=0 !=0
0.0

-4 -2 0 2 4

Ability in logit units


Probability of correct |ability and difficulty

4PL items differing in guessing, difficulty and asymptote

!!!
P(x) = ! +
0.8

1 + e!(!!!) ! = 0.2 ! = 0.8


! = 0.2 ! = ! 1
0.4

! = 0.2 ! = 1
!=0 !=0
0.0

-4 -2 0 2 4

Ability in logit units

Fig. 8.6 The 3PL model adds a lower asymptote (γ) to model the effect of guessing. The 4PL model
adds yet another parameter (ζ ) to reflect the tendency to never respond to an item. As the number
of parameters in the IRT model increases, the fit tends to be better, but at the cost of item by ability
interactions. It is more difficult to make inferences from an item response if the probability of passing
or endorsing an item is not an additive function of ability and item difficulty.

ζj −γj
P(x|θi , δ j , γ j , ζ j ) = γ j + . (8.28)
1 + eα j (δ j −θi
When considering the advantages of these more complicated IRT models as compared to
classic test theory Reise and Waller (2009) note that only with proper IRT scoring can we
detect the large differences between individuals who differ only slightly on total score. This
is the case for extremely high or low trait scores that are not in the normal range. But this
is an argument in favor of all IRT models and merely implies that items should be “difficult”
enough for the levels of the trait of interest.
256 8 The “New Psychometrics” – Item Response Theory

8.2 Polytomous items

Although ability items are usually scored right or wrong, personality items frequently have
multiple response categories with a natural order. In addition, there is information that
can be obtained from the pattern of incorrect responses in multiple choice ability items.
Techniques addressing these type of items were introduced by Samejima (1969) and discussed
in great detail by Ostini and Nering (2006); Reise et al. (1993); Samejima (1996); Thissen
and Steinberg (1986).

8.2.1 Ordered response categories

A typical personality item might ask “How much do you enjoy a lively party” with a five point
response scale ranging from “1: not at all” to “5: a great deal” with a neutral category at 3.
The assumption is that the more sociable one is, the higher the response alternative chosen.
The probability of endorsing a 1 will increase monotonically the less sociable one is, the
probability of endorsing a 5 will increase monotonically the more sociable one is. However,
to give a 2 will be a function of being above some threshold between 1 and 2 and below some
threshold between 2 and 3. A possible response model may be seen in Figure 8.7-left panel.
This graded response model is based upon the early work of Samejima (1969) who discussed
how a n alternative scale reflects n-1 thresholds for normal IRT responses (Figure 8.7-right
panel). The probability of giving the lowest response is just the probability of not passing
the first threshold. The probability of the second response is the probability of passing the
first threshold but not the second one, etc. For the 1PL or 2PL logistic model the probability
of endorsing the kth response is a function of ability, item thresholds, and the discrimination
parameter and is
1 1
P(r = k|θi , δk , δk−1 , αk ) = P(r|θi , δk−1 , αk ) − P(r|θi , δk , αk ) = −
1 + eαk (δk−1 −θi ) 1 + eαsk (δk −θi )
(8.29)
where all bk are set to bk = 1 in the 1PL Rasch case.
Because the probability of a response in a particular category is the difference in proba-
bilities of responding to a lower and upper threshold, the graded response model is said to be
a difference model (Thissen and Steinberg, 1986, 1997) which is distinguished from a family
of models known as divide by total models. These model include the rating scale model in
which the probability of response is the ratio of two sums:
x
e∑s=1 (θ −δi −τs )
P(Xi = x|θ ) = m (8.30)
q
∑ e∑s=1 (θ −δi −τs )
q=0

where τs is the the difficulty associated with the s response alternatives and the total number
of alternatives to the item is m (Sijtsma and Hemker, 2000).
Estimation of the graded response model may be done with the grm function in the ltm
package and of the rating scale model with the RSM function in the eRm package. Consider the
Computer Anxiety INdex or CAIN inventory discussed by Bond and Fox (2007) and available
either from that text or an online repository at http://homes.jcu.edu.au/~edtgb/book/
8.2 Polytomous items 257

Five level response scale Four response functions

1.0

1.0
0.8

0.8
Probability of endorsement

Probability of response
0.6

0.6
1 3 5 d = -2 d = -1 d =1 d =2
0.4

0.4
2 4
0.2

0.2
0.0

0.0
-4 -2 0 2 4 -4 -2 0 2 4

Latent attribute on logit scale Latent trait - logit units

Fig. 8.7 The response probability to the five alternatives of ordered categories will be monotonically
decreasing (for the lowest alternative), single peaked (for the middle alternatives, and monotonically
increasing for the highest alternative (left panel). These five response patterns are thought to reflect
four response functions (right panel). The difficulty thresholds are set arbitrarily to -2, -1, 1, and 2.

data/Cain.dat.txt. Some of the CAIN items as presented need to be reversed scored before
analysis can proceed. Doing so using the reverse.code function allows an analysis using the
grm function.

8.2.2 Multiple choice ability items

An analysis of the response probabilities for all the alternatives of a multiple choice test
allows for an examination of whether some distractors are behaving differently from others.
This is particularly useful when examining how each item works in an item bank. With the
very large samples available to professional testing organizations, this can be done empirically
without model fitting (Wainer, 1989). But with smaller samples this can be done using the
item parameters (Thissen et al., 1989). In Figure 8.8 distractor D1 seems to be equally
attractive across all levels of ability, D2 and D3 decrease in their probability of response as
ability increases, and distractor D4 appeals to people with slightly less than average ability.
The correct response, while reflecting guessing at the lowest levels of ability increases in the
probability of responding as ability increases. This example, although hypothetical, is based
upon real items discussed by Thissen et al. (1989) and Wainer (1989).
258 8 The “New Psychometrics” – Item Response Theory

Table 8.2 The grm function in the the ltm package may be used to analyze the CAIN data set from
Bond and Fox (2007) using a graded response model. The data are made available as a supplement
to Bond and Fox (2007) and may be retrieved from the web at http://homes.jcu.edu.au/~edtgb/
book/data/Cain.dat.txt. Some of the items need to be reverse coded before starting. The model was
constrained so that all items had an equal discrimination parameter. The coef function extracts the
cutpoints from the analysis, the order function allows for sorting the items based upon their first
threshold and headtail reports the first and last four lines.

keys <- c(-1,-1,-1,-1,1,-1,1,-1,1,-1,1,1,-1,1,-1,-1,-1,1,1,1,1,1,1,1,1,-1)


rev.cain <- reverse.code(keys,cain,mini=rep(1,26),maxi=rep(6,26))
cain.grm <- grm(rev.cain,constrained=TRUE)
cain.coef <- coef(cain.grm)
sorted.coef <- cain.coef[order(cain.coef[,1]),]
headtail(sorted.coef)

Extrmt1 Extrmt2 Extrmt3 Extrmt4 Extrmt5 Dscrmn


Item 21 -5.02 -3.26 -2.71 -1.91 -0.42 1.15
Item 1 -4.96 -2.69 -1.93 -0.35 1.7 1.15
Item 13 -4.81 -3.91 -3.01 -1.99 -0.48 1.15
Item 9 -4.48 -3.57 -2.69 -2.13 -1.07 1.15
... ... ... ... ... ... ...
Item 14 -3.15 -2.3 -1.39 -0.52 0.89 1.15
Item 23 -2.94 -1.96 -1.1 -0.21 0.85 1.15
Item 15 -2.36 -1.43 -0.6 0.35 1.41 1.15
Item 25 -1.02 -0.12 0.58 1.17 1.69 1.15

8.3 IRT and factor analysis of items

At first glance, the concepts of factor analysis as discussed in Chapter 6 would seem to be very
different from the concepts of Item Response Theory. This is not the case. Both are models of
the latent variable(s) associated with observed responses. Consider first the case of one latent
variable. If the responses are dichotomous functions of the latent variable then the observed
correlation between any two responses (items) is a poor estimate of their underlying latent
correlation. The observed φ correlation is attenuated both because of the dichotomization,
but also if there are any differences in mean level for items. With the assumption of normally
distributed latent variables, the observed two x two pattern of responses may be used to
estimate the underlying, latent, correlations using the tetrachoric correlation (4.5.1.4).
If the correlations of all of the items reflect one underlying latent variable, then factor
analysis of the matrix of tetrachoric correlations should allow for the identification of the
regression slopes (α) of the items on the latent variable. These regressions are, of course just
the factor loadings. Item difficulty, δ j and item discrimination, α j may be found from factor
analysis of the tetrachoric correlations where λ j is just the factor loading on the first factor
and τ j is the normal threshold reported by the tetrachoric function (McDonald, 1999; Lord
and Novick, 1968; Takane and de Leeuw, 1987).

Dτ λj
δj = � , αj = � (8.31)
1 − λ j2 1 − λ j2
8.3 IRT and factor analysis of items 259

Multiple choice ability item

1.0
0.8
Correct
Probability of response

0.6
0.4

D4

D3
0.2

D2
D1
0.0

-4 -2 0 2 4

Ability in logit units

Fig. 8.8 IRT analyses allow for the detection of poorly functioning distractors. Although the correct
response is monotonically increasing as a function of ability, and the probability of responding to
distractors D1...D3 decreases monotonically with ability, the distractor D4 (dotted line) seems to
appeal to people with slightly below average ability. This item should be examined more closely.

where D is a scaling factor used when converting to the parameterization of logistic model
and is 1.702 in that case and 1 in the case of the normal ogive model. Thus, in the case of
the normal model, factor loadings (λ j ) and item medians (τ) are just

αj δj
λj = � , τj = � .
1 + α 2j 1 + α 2j

Consider the item data discussed in 6.6. These were generated for a normal model with
difficulties ranging from -2.5 to 2.5 and with equal slopes of 1. Applying the equations for
δ and α (8.31) to the τ and λ estimates from the tetrachoric and fa functions results in
difficulty estimates ranging from -2.45 to 2.31 and that correlate > .995 with the theoretical
values and with those found by either the ltm or rasch functions in the ltm package. Similarly,
although the estimated slopes are not all identical to the correct value of 1.0, they have a
mean of 1.04 and range from .9 to 1.18. (Table 8.3).
260 8 The “New Psychometrics” – Item Response Theory

Table 8.3 Three different functions to estimate irt parameters. irt.fa factor analyzes the tetrachoric
correlation matrix of the items. ltm applies a two parameter model, as does rasch. The slope parameter
in the rasch solution is constrained to be equal 1. Although the estimated values are different, the three
estimates of the item difficulty correlate > |.995| with each other and with the actual item difficulties
(a). The slope parameters in the population were all equal 1.0, and the differences in the estimates of
slopes between the models reflect random error. Note that the ltm and rasch difficulties are reversed
in sign from the a values and irt.fa values.

set.seed(17)
items <- sim.npn(9,1000,low=-2.5,high=2.5)$items
p.fa <- irt.fa(items)$coefficients[1:2]
p.ltm <- ltm(items~z1)$coefficients
p.ra <- rasch(items, constraint = cbind(ncol(items) + 1, 1))$coefficients
a <- seq(-2.5,2.5,5/8)
p.df <- data.frame(a,p.fa,p.ltm,p.ra)
round(p.df,2)

a Difficulty Discrimination X.Intercept. z1 beta.i beta


Item 1 -2.50 -2.45 1.03 5.42 2.61 3.64 1
Item 2 -1.88 -1.84 1.00 3.35 1.88 2.70 1
Item 3 -1.25 -1.22 1.04 2.09 1.77 1.73 1
Item 4 -0.62 -0.69 1.03 1.17 1.71 0.98 1
Item 5 0.00 -0.03 1.18 0.04 1.94 0.03 1
Item 6 0.62 0.63 1.05 -1.05 1.68 -0.88 1
Item 7 1.25 1.43 1.10 -2.47 1.90 -1.97 1
Item 8 1.88 1.85 1.01 -3.75 2.27 -2.71 1
Item 9 2.50 2.31 0.90 -5.03 2.31 -3.66 1

Simulations of IRT data with 1 to 4 parameter models and various distributions of the
underlying latent may be done using the sim.irt function. The irt.fa function may be used
to factor analyze dichotomous items (using tetrachoric correlations) and to express the results
in terms of IRT parameters of difficulty (δ ) and discrimination (α). Plotting the subsequent
results shows the idealized two parameter item characteristic curves or iccs. In a very helpful
review of the equivalence of the Item Response and factor analytic approaches, Wirth and
Edwards (2007) review various problems of bias in estimating the correlation matrix and
hence the factor loadings and compare the results of various Confirmatory Categorical Factor
Analyses (CCFAs). Comparing their results to those of irt.fa suggests that the simple
approach works quite adequately.

8.4 Test bias and Differential Item Functioning

If items differ in their parameters between different groups this may be seen as test bias
or more generally as Differential Item Functioning (DIF ). (Test bias implies that the items
differentially favor one group over another, whereas DIF includes the case where an item is
either easier or harder, or more or less sensitive to the underlying trait for different groups).
Consider the case of sex differences in depression. Items measuring depression (e.g., “In the
past week I have felt downhearted or blue” or “In the past week I felt hopeless about the
future” have roughly equal endorsement characteristics for males and females. But the item
8.6 IRT and adaptive testing 261

“In the past week I have cried easily or felt like crying” has a much higher threshold for men
than for women (Schaeffer, 1988; Steinberg and Thissen, 2006). This example of using IRT to
detect DIF may be seen clearly in Figure 8.9. As pointed out by Steinberg and Thissen (2006)
“For many purposes, graphical displays of trace lines and/or their differences are easier to
interpret than the parameters themselves” (p 406).
There are a number of ways to test for DIF, including differences in item difficulty as
well as differences in item sensitivity. Of course, when using a one parameter model, only
differences in difficulty can be detected while the sort of difference discussed by Steinberg
and Thissen (2006) requires at least a two parameter model. In their development of items
for short scales to assess physical and emotional well being Lai et al. (2005) consider items
that differ more than .5 logistic units to be suspect.
Sometimes tests will have items that have compensatory DIF parameters. That is, a test
that includes some items that are easier for one group (the reference group) than another
(the focal group), and then other items that are harder for the reference group than the focal
group. It is thus logically possible to have unbiased tests (no differential test functioning
made up of items that themselves show DIF (Raju et al., 1995).
Steinberg and Thissen (2006)
Reeve and Fayers (2005)

8.5 Non-monotone trace lines – the measurement of attitudes

8.5.1 Unfolding theory revisited

8.5.2 Political choice

Van Schuur and Kruijtbosch (1995) Chernyshenko et al. (2007) Borsboom and Mellenbergh
(2002)

8.6 IRT and adaptive testing

Whether comparing school children across grade levels, or assessing the health of medical
patients, the usefulness of IRT techniques becomes most clear. For classical test theory in-
creases reliability by aggregating items. This has the unfortunate tendency of having many
items that are either too easy or too difficult for any one person to answer and requires tests
that are much too long for practical purposes. The IRT alternative is to give items that are
maximally informative for the person at hand, rather than people in general. If items are
tailored to the test taker, then people are given items that they will endorse (pass) with a
50% likelihood. (The analogy to high jumping is most appropriate here: Rather than ask an
olympic athlete to jump bars of .5, .6, ...,1, ..., 1.5, ... 2.35,2.4,2.45, 2.5 meters, the first jump
will typically be 2.2 or 2.3 meters. Similarly, for elementary school students, jumps might be
.2, .. 1 meters, with no one being given a 2.4 meter bar). Thus, the person’s score is not how
many they passed, but rather a function of how difficult were the items they passed or failed.
262 8 The “New Psychometrics” – Item Response Theory

Differential Item Functioning

1.0
Males
Females

0.8
Probability of Response

0.6
0.4
0.2
0.0

-3 -2 -1 0 1 2 3

Fig. 8.9 When measuring depression, the item “In the past week I have cried easily or felt like crying”
has a lower threshold (δ = 1.3) and is slightly more sensitive to the latent trait (α = .8 for women than
it is for men (δ = 3.3, α = .6)). The figure combines item characteristic curves (montonically rising)
with the distribution of participant scores (normal shaped curves). Data from Schaeffer (1988), figure
adapted from Steinberg and Thissen (2006)).

In an adaptive testing framework, the first few items that are given are of average difficulty.
People passing the majority of those items are then given harder items. If they pass those as
well, then they are given yet harder items. But, if they pass the first set and then fail most
of the items in the second set, the third set will be made up of items in between the first and
second set. The end result of this procedure is that everyone could end up with the same
overall passing rate (50%) but drastically different scores (Cella et al., 2007; Gershon, 2005).
Unfortunately, this efficiency of testing has a potential bias in the case of high stakes testing
(e.g., the Graduate Record Exam or a nursing qualification exam). For people who are higher
on anxiety tend to reduce their effort following failure while those lower on anxiety increase
their effort following failure. Given that the optimal adaptive test will produce failure 50%
of the time, adaptive testing will lead to an underestimate of the ability of the more anxious
participants as they will tend to reduce their effort as the test proceeds (Gershon, 1992).
8.9 Classical versus IRT models – does it make a difference? 263

8.7 Item banking and item comparison

When doing adaptive testing or developing short forms of instruments that are optimized for
a particular target population, it is necessary to have a bank of items of various difficulties
in order to choose the most discriminating ones. This is not much different from the normal
process of scale development (to be discussed in Chapter 16) but focuses more upon choosing
items to represent the full range of item location (difficulty). Developing an item bank requires
understanding the construct of interest, writing items at multiple levels of difficulty and
then validating the items to make sure that they represent the relevant construct. Items are
chosen based upon their discriminability (fit with the dimension) and location (difficulty).
Additional items are then developed to fit into gaps along the difficulty dimension. Items
need to be shown to be equivalent across various sub groups and not to show differential
item functioning.
A very nice example of the use of item banking in applied testing of various aspects of
health is the “Patient-Reported Outcomes Measurement Information System” (PROMIS )
developed by the National Institutes of Health. Combining thousands of different items as-
sessing symptoms and outcomes as diverse as breathing difficulties associated with chronic
lung disease to pain symptoms to satisfaction with life, the PROMIS item bank is used for
Computerized Adaptive Testing (CAT ) presenting much shorter and precise questionnaires
than would have been feasible using conventional test (Cella et al., 2007).
Although most of the PROMIS work has been done using commercial software, more
recent developments have used IRT functions Linking solutions together across group may
be done using the plink (Weeks, 2010).

8.8 Non-parametric IRT

Mokken and Lewis (1982) but see Roskam et al. (1986) and a response Mokken et al. (1986)

8.9 Classical versus IRT models – does it make a difference?

Perhaps the greatest contribution of IRT methods is that person scores are in the same
metric as the items. That is, rather than saying that someone is one standard deviation
higher than another, it is possible to express what that means in probability of response.
With items with a mean probability of .5, being one logit unit higher than the mean implies
a probability of correct response of .73, while being two logit units higher implies a probability
of correct response of .88. This is much more informative about the individual than saying
that someone is one or two standard deviations above the mean score which without knowing
the sample standard deviation tells us nothing about how likely the person is to get an item
correct. In addition, rather than saying that a test has a reliability of rxx , and therefore
√ the
standard error of measurement for any particular person’s score is estimated to be σx 1 − rxx ,
1
it is possible to say that a particular score has an error of estimation of √in f ormation . That is,
some scores can be measured with more precision than others. In addition, by recognizing
that the relationship between the underlying attribute and the observed score is monotonic
264 8 The “New Psychometrics” – Item Response Theory

(in the case of ability) but not linear, it is less likely that errors of inference due to scaling
artifacts (e.g., 3.6) will be made.
In addition, to the practitioner the obvious advantage of IRT techniques is the possibility
of item banking and adaptive testing (Cella et al., 2007). For it makes little sense to give
items that are much too easy or much to hard to measure a particular attribute. The savings
in tester and testee time is an important benefit of adaptive testing.
Why then do people still use classical techniques? To some dedicated IRT developers,
using classical test theory is an anachronism that will fade with better training in measure-
ment (Embretson, 1996). But to others, the benefits of combining multivariate models such
as factor analysis or structural equation modeling with a theory of tests that is just a dif-
ferent parameterization of IRT values outweighs the benefits of IRT (McDonald, 1999). The
supposed clear superiority of IRT is seen as an “urban legend” and rather IRT and CTT
techniques each have their advantage (Zickar and Broadfoot, 2009). Because the correlation
between CTT scores and IRT based scores tend to be greater than .97 for normal samples
(Fan, 1998), the simple addition of items following principles of CTT seem quite adequate.
In a very balanced discussion of how to construct quality of life inventories, Reeve and Fayers
(2005) consider the benefits of both CTT and IRT. The great benefit of IRT techniques is
the emphasis in characteristics of the items (e.g., in identifying DIF), and the ability to select
items from item banks to optimally measure any particular level of a trait.
The graphical displays of item information as a function of item location and item discrim-
ination make it easier to see and explain why some items were chosen and how and why to
choose items ranging in their difficulty/location. The appeal of multi-parameter IRT models
to the theorist concerned with best fitting complex data sets with elegant models needs to
be contrasted with the simple addition of item responses for the practitioner. The use of
IRT for selecting items from a larger item pool is a clear advantage for the test developer,
but of less concern for the user who just wants to rank order individuals on some construct
and have an estimate of the confidence interval for any particular score. But for the theorist
or the practitioner, R includes multiple packages that provide powerful alternatives to the
commercial IRT programs.
Chapter 9
Validity

Reliability is how compact a flight of arrows are, validity is whether you hit the target. Oxygen
titrations can be reliable, but if the chemist is color blind, they are not valid measures of the
oxygen level when compared to someone elses measures.

9.1 Types of validity

Does a test measure what it supposed to test? How do we know? Shadish et al. (2001),
Borsboom and Mellenbergh (2004)

9.1.1 Face or Faith

Does the content appear reasonable? Is this important?

9.1.2 Concurrent and Predictive

Do tests correlate with alternative measures of the same construct right now and do they
allow future predictions?

9.1.3 Construct

What is the location of our measure in our nomological network. Cronbach and Meehl (1955)

9.1.3.1 convergent

Do measures correlate with what they should correlate with given the theory?

265
266 9 Validity

9.1.3.2 discriminant

Do measures not correlate with what the theory says they should not correlate with? What
is the meaning of a hyperplane? Measuring what something isn’t is just as important as
knowing what something is.

9.1.3.3 incremental

Does it make any difference if we add a test to a battery? Werner Wittmann and the principle
of Brunswickian Symmetry Wittmann and Matt (1986) Also, the notion of higher order versus
lower order predictions.

9.2 Validity and cross validation

9.2.1 Cross validation

9.2.2 Resampling and cross validation

Grucza and Goldberg (2007)

9.3 Validity: an modern perspective

Borsboom and Mellenbergh (2004) The ontology vs. epistemology of validity.

9.4 Validity for what?

9.4.1 Validity for the institution

9.4.2 validity for the individual

9.5 Validity and decision making

Wiggins (1973)
Chapter 10
Reliability + Validity = Structural Equation
Models

10.1 Generating simulated data structures

10.2 Measures of fit

As has been seen in the previous sections, the use of fit statistics does not guarantee mean-
ingful models. If we do not specify the model correctly, either because we do not include the
correct variables or because we fail to use the appropriate measurement model, we will lead
to incorrect conclusions. Widaman and Thompson (2003) MacCallum et al. (2006) Marsh
et al. (2005)
Even if we have a very good fit, we are unable to determine causal structure from the
model, even if we bother to add time into the model.

10.2.1 χ 2

As we saw in the previous chapter, χ 2 is very sensitive to many sources of error in our
model specification. χ 2 is sensitive to failures of our distributional assumptions (continuous,
multivariate normal) as well as to our failures to correctly specify the structure.

267
268 10 Reliability + Validity = Structural Equation Models

10.2.2 GFI, NFI, ...

10.2.3 RMSEA

10.3 Reliability (Measurement) models

10.3.1 One factor — congeneric measurement model

10.3.1.1 Generating congeneric data structures

10.3.1.2 Testing for Tau equivalent and congeneric structures

10.3.2 Two (perhaps correlated) factors

10.3.2.1 Generating multiple factorial data

10.3.2.2 Confirmatory factoring using sem

10.3.3 Hierarchical measurement models

10.3.3.1 Generating the data for three correlated factors

10.3.3.2 Testing hierarchical models

10.4 Reliability + Validity = Structural Equation Models

10.4.1 Factorial invariance

10.4.2 Multiple group models

10.5 Evaluating goodness of fit

10.5.1 Model misspecification: Item quality

10.5.1.1 Continuous, ordinal, and dichotomous data

Most advice on the use of latent variable models discusses the assumption of multivariate
normality in the data. Further discussions include the need for continuous measures of the
observed variables. But how does this relate to the frequent use of SEM techniques in analysis
of personality or social psychological items rather than scales? In this chapter we consider
typical problems in personality where we are interested in the structure of self reports of
personality, emotion, or attitude. Using simulation techniques, we consider the effects of
10.6 What does it mean to fit a model 269

normally distributed items, ordinal items with 6 or 4 or 2 levels, and then the effect of skew
on these results. We use simulations to show the results more clearly. For a discussion of real
data with some of these problems, see Rafaeli and Revelle (2006).

10.5.1.2 Simple structure versus circumplex structure

Most personality scales are created to have “simple structure” where items load on one and
only one factor Revelle and Rocklin (1979); Thurstone (1947). The conventional estimate
for the reliability and general factor saturation of such a test is Cronbach’s coefficient α
(Cronbach, 1951) Variations of this model include hierarchical structures where all items
load on a general factor, g, and then groups of items load on separate, group, factors Carroll
(1993); Jensen and Weng (1994). Estimates of the amount of general factor saturation for
such hierarchical structures may be found using the ω coefficient discussed by (McDonald,
1999) and (Zinbarg et al., 2005).
An alternative structure, particularly popular in the study of affect as well as studies
of interpersonal behavior is a “circumplex structure” where items are thought to be more
complex and to load on at most two factors.
“A number of elementary requirements can be teased out of the idea of circumplex structure.
First, circumplex structure implies minimally that variables are interrelated; random noise does
not a circumplex make. Second, circumplex structure implies that the domain in question is
optimally represented by two and only two dimensions. Third, circumplex structure implies that
variables do not group or clump along the two axes, as in simple structure, but rather that there
are always interstitial variables between any orthogonal pair of axes Saucier (1992). In the ideal
case, this quality will be reflected in equal spacing of variables along the circumference of the
circle Gurtman (1994)(Gurtman, 1994; Wiggins, Steiger, & Gaelick, 1981). Fourth, circumplex
structure implies that variables have a constant radius from the center of the circle, which
implies that all variables have equal communality on the two circumplex dimensions (Fisher,
1997; Gurtman, 1994). Fifth, circumplex structure implies that all rotations are equally good
representations of the domain (Conte & Plutchik, 1981; Larsen & Diener, 1992).” (Acton and
Revelle, 2004).

Variations of this model in personality assessment include the case where items load on two
factors but the entire space is made up of more factors. The Abridged Big Five Circumplex
Structure (AB5C) of (Hofstee et al., 1992b) is an example of such a structure. That is, the
AB5C items are of complexity one or two but are embedded in a five dimensional space.

10.5.2 Model misspecification: failure to include variables

10.5.3 Model misspecification: incorrect structure

10.6 What does it mean to fit a model


Appendix A
R: Getting started

There are many possible statistical programs that can be used in psychological research.
They di↵er in multiple ways, at least some of which are ease of use, generality, and cost. Some
of the more common packages used are SAS, SPSS, and Systat. These programs have GUIs
(Graphical User Interfaces) that are relatively easy to use but that are unique to each package.
These programs are also very expensive and limited in what they can do. Although convenient
to use, GUI based operations are difficult to discuss in written form. When teaching statistics
or communicating results, it is helpful to use examples that others may use, perhaps in
other computing environments. This book, as well as other texts in the Using R series,
describes an alternative approach that is widely used by practicing statisticians, the statistical
environment R. This appendix is not meant as a complete user’s guide to R, but merely the
first step in using R for psychometrics in particular and psychological research in general.
Throughout the text, examples of analyses are given in R. But what is R and how to get
it to work on your computer is perhaps the first question the reader faces.

A.1 R: A statistical programming environment

The R Development Core Team (2012) has developed an extremely powerful “language and
environment for statistical computing and graphics” and a set of packages that operate within
this programming environment (R). The R program is an open source version of the statistical
program S and is very similar to the statistical program based upon S, S-PLUS (also known
as S+). Although described as merely “an e↵ective data handling and storage facility [with]
a suite of operators for calculations on arrays, in particular, matrices” R is, in fact, a very
useful interactive package for data analysis. When compared to most other stats packages
used by psychologists, R has at least three compelling advantages: it is free, it runs on multiple
platforms (e.g., Windows, Unix, Linux, and Mac OS X and Classic), and combines many of
the most useful statistical programs into one quasi integrated environment. R is free1 , open
source software as part of the GNU2 Project. That is, users are free to use, modify, and
distribute the program, within the limits of the GNU non-license). The program itself and

1 Free as in speech rather than as in beer. See http://www.gnu.org


2 GNU’s Not Unix

437
438 A R: Getting started

detailed installation instructions for Linux, Unix, Windows, and Macs are available through
CRAN (Comprehensive R Archive Network) at http://www.r-project.org3
Although many run R as a language and text oriented programming environment, there
are GUIs available for PCs, Linux and Macs. See for example, R Commander by John Fox
or R-app for the Macintosh developed by Stefano Iacus and Simon Urbanek. Compared to
the basic PC environment, the Mac GUI is to be preferred.
R is an integrated, interactive environment for data manipulation and analysis that in-
cludes functions for standard descriptive statistics (means, variances, ranges) and also in-
cludes useful graphical tools for Exploratory Data Analysis. In terms of inferential statistics
R has many varieties of the General Linear Model including the conventional special cases of
Analysis of Variance, MANOVA, and linear regression. Statisticians and statistically minded
people around the world have contributed packages to the R Group and maintain a very active
news group o↵ering suggestions and help. The growing collection of packages and the ease
with which they interact with each other and the core R is perhaps the greatest advantage of
R. Advanced features include correlational packages for multivariate analyses including Fac-
tor and Principal Components Analysis, and cluster analysis. Advanced multivariate analyses
packages that have been contributed to the R-project include one for Structural Equation
Modeling (sem), Multi-level modeling (also known as Hierarchical Linear Modeling and re-
ferred to as non linear mixed e↵ects in the nlme4 package) and taxometric analysis. All of
these are available in the free packages distributed by the R group at CRAN. Many of the
functions described in this book are incorporated into the psych package. Other packages
useful for psychometrics are described in a task-view at CRAN. In addition to be a envi-
ronment of prepackaged routines, R is a interpreted programming language that allows one
to create specific functions when needed.
R is also an amazing program for producing statistical graphics. A collection of some of
the best graphics is available at the webpage http://addictedtor.free.fr/graphiques/
with a complete gallery of thumbnail of figures.

A.2 General comments

R is not overly user friendly (at first). Its error messages are at best cryptic. It is, however,
very powerful and once partially mastered, easy to use. As additional packages are added, it
becomes even more useful. Packages available at CRAN may be found and downloaded by
using the package manager command (or menu option for the Mac R-Gui). To download
a number of relevant packages all at once, the taskviews function will install packages
recommended for particular applications (e.g., Psychometrics, Socialsciences, etc.
Commands may be entered directly into the “Console” window and executed immediately.
In this regard, R can be thought of as a very advanced graphing calculator. Alternatively,
the Mac and PC versions have s a text editor window that allows you to write, edit and
save your commands. For all systems, if you use a normal text editor (As a Mac user, I
use BBEDIT, PC users can use Notepad or TINN-R, many users prefer the ESS – emacs

3 The R Development Core Team (2012) releases an updated version of R about every six months.
That is, as of March, 2012, the current version of 2.14.2 will be replaced with 2.15.0 sometime in
March of 2012. Bug fixes are then added with a sub version number (e.g. 2.12.2 fixed minor problems
with 2.12.1). It is recommended to use the most up to date version, as it will incorporate various
improvements and operating efficiencies.
A.3 Using R in 12 simple steps 439

speaks statistics– editor), you can write out the commands you want to run, comment them
so that you can remember what they do the next time you run a similar analysis, and then
copy and paste them into the R console. You can add a comment to any line by using a #.
Anything following the # is printed but ignored for execution. This allows you to document
your commands as you write them. The history window keeps track of recent commands.
The R code throughout this text is meant to be copied and pasted into R.
Although being syntax driven seems a throwback to an old, pre Graphical User Interface
type command structure, it is very powerful for doing production statistics. Once you get a
particular set of commands to work on one data file, you can change the name of the data file
and run the entire sequence again on the new data set. This is is also very helpful when doing
professional graphics for papers. In addition, for teaching, it is possible to prepare a web page
of instructional commands that students can then cut and paste into R to see for themselves
how things work. That is what may be done with the instructions throughout this book.
It is also possible to write text in LATEX with embedded R commands. Then executing the
Sweave function on that text file will add the R output to the LATEX file. This almost magical
feature allows rapid integration of content with statistical techniques. More importantly, it
allows for literate programming and reproducible research (Leisch and Rossini, 2003) in that
the actual data files and instructions may be specified for all to see.

A.3 Using R in 12 simple steps

(These steps are not meant to limit what can be done with R, but merely to describe how to
do the analysis for the most basic of research projects and to give a first experience with R).
1. Install R on your computer or go to a machine that has it.
2. Download the psych package as well as other recommended packages from CRAN using
the install.packages function, or using the package installer in the GUI. To get packages
recommended for a particular research field, use the ctv package to install a particular task
view . Note, these first two steps need to be done only once!
3. Activate the psych package or other desired packages using e.g., library(psych). This
needs to be done every time you start R. Or, it is possible to modify the startup parameters
for R so that certain libraries are loaded automatically.
4. Enter your data using a text editor and save as a text file (perhaps comma delimited if
using a spreadsheet program such as Excel or OpenOffice)
5. Read the data file or copy and paste from the clipboard (using, e.g., read.clipboard).
6. Find basic descriptive statistics (e.g., means, standard deviations, minimum and maxima)
using describe.
7. Prepare a simple descriptive graph (e.g, a box plot) of your variables.
8. Find the correlation matrix to give an overview of relationships (if the number is not
too great, a scatter plot matrix or SPLOM plot is very useful, this can be done with
pairs.panels.
9. If you have an experimental variable, do the appropriate multiple regression using stan-
dardized or at least zero centered scores.
10. If you want to do a factor analysis or principal components analysis, use the factanal or
fa and principal functions.
440 A R: Getting started

11. To score items and create a scale and find various reliability estimates, use score.items
and perhaps omega.
12. Graph the results.

A.4 Getting started

A.4.1 Installing R on your computer

Although it is possible that your local computer lab already has R, it is most useful to do
analyses on your own machine. In this case you will need to download the R program from
the R project and install it yourself. Using your favorite web browser, go to the R home page
at http://www.r-project.org and then choose the Download from CRAN (Comprehensive
R Archive Network) option. This will take you to list of mirror sites around the world. You
may download the Windows, Linux, or Mac versions at this site. For most users, downloading
the binary image is easiest and does not require compiling the program. Once downloaded,
go through the install options for the program. If you want to use R as a visitor it is possible
to install R onto a “thumb drive” or “memory stick” and run it from there. (See the R for
Windows FAQ at CRAN).

A.4.2 Packages and Task Views

One of the great strengths of R is that it can be supplemented with additional programs that
are included as packages using the package manager. (e.g., sem or OpenMX do structural
equation modeling) or that can be added using the source command. Most packages are
directly available through the CRAN repository. Others are available at the BioConductor
(http://www.bioconductor.org) repository. Yet others are available at “other” repositories.
The psych package (Revelle, 2012) may be downloaded from CRAN or from the http:
//personality-project.org/r repository.
The concept of a “task view” has made downloading relevant packages very easy. For
instance, the install.views("Psychometrics") command will download over 20 packages
that do various types of psychometrics. To install the Psychometrics task view 4
> install.packages("ctv")
> library(ctv)
> install.views("Psychometrics")
For any other than the default packages to work, you must activate it by either using the
Package Manager or the library command:
• e.g., library(psych) or library(sem)
• entering ?psych will give a list of the functions available in the psych package as well as
an overview of their functionality.

4 Here, as throughout the book, the “>” represents the R system prompt to enter a new line. Do not
enter it, but just rather enter the text following the prompt.
A.4 Getting started 441

• objects(package:psych) will list the functions available in a package (in this case,
psych).
> library(psych)
> ?psych
> objects(package:psych)
If you routinely find yourself using the same packages everytime you use R, you can modify
the Startup process by specifying what should happen .First. Thus, if you always want to
have psych available,
.First <- function(library(psych))
and then when you quit, use the save workspace option.

A.4.3 Help and Guidance

R is case sensitive and does not give overly useful diagnostic messages. If you get an error
message, don’t be flustered but rather be patient and try the command again using the
correct spelling for the command.
When in doubt, use the help(somefunction) function. This is identical to ? somefunction
where some function is what you want to know about. e.g.,
> ?read.table #ask for help in using the read.table function the answer is in the help window
> help(read.table) #another way of asking for help
> ??read #searches for all uses of the function in all your packages.
> apropos("read") #returns all available functions with that term in their name
> RSiteSearch("read") #opens a webbrowser and searches voluminous files
RSiteSearch(“keyword”) will open a browser window and return a search for “keyword” in
all functions available in Rand the associated packages as well (if desired) the R-Help News
groups.
All packages and all functions will have an associated help window. Each help window will
give a brief description of the function, how to call it, the definition of all of the available
parameters, a list (and definition) of the possible output, and usually some useful examples.
One can learn a great deal by using the help windows, but if they are available, it is better
to study the package vignette.

A.4.4 Package vignettes

All packages have help pages for each function in the package. These are meant to help you
use a function that you already know about, but not to introduce you to new functions. An
increasing number of packages have a package vignettes that give more of an overview of the
program than a detailed description of any one function. These vignettes are accessible from
the help window and sometimes as part of the help index for the program. The two vignettes
for the psych package are also available from the personality project web page. (An overview
of the psych package and Using the psych package as a front end to the sem package).
442 A R: Getting started

A.5 Basic R commands and syntax

A.5.1 R is just a fancy calculator

One can think of R as a fancy graphics calculator. Enter a command and look at the output.
Thus,
> 2 + 2 #returns the output
4
or, somwhat more fun, try graphing a function or two:
curve(sin, -2*pi, 2*pi)
curve(tan, main = "curve(tan) --> same x-scale as previous plot")
At the somewhat more abstract level, almost all operations in R consists of executing a
function on an object. The result is a new object. This very simple idea allows the output of
any operation to be operated on by another function.
Command syntax tends to be of the form:
variable = function (parameters) or
variable <- function (parameters)
The = and the <- symbol imply replacement, not equality. The preferred style is to use the
<- symbol to avoid confusion with the test for equality (==).
The result of an operation will not necessarily appear unless you ask for it. The command
m <- mean(x)
will find the mean of x but will not print anything on the console without the additional
request
m.
however, just asking mean(x)
will find the mean and print it.

A.5.1.1 R is also a statistics table

It has been suggested by some that you should never buy a statistics book that has probability
tables in it, because that means that the author did not know about the various distributions
in R. Many statistics books include tables of the t or F or c 2 distribution. By using R this is
unnecessary since these and many more distributions can be obtained directly. Consider the
normal distribution as an example. dnorm(x, mean=mu, sd=sigma) will give the probability
density of observing that x in a distribution with mean=mu and standard deviation= sigma.
pnorm(q,mean=0,sd=1) will give the probability of observing the value q or less. qnorm(p,
mean=0, sd=1) will give the quantile value of a value with probability p. rnorm(n,mean,sd)
will generate n random observations sampled from the normal distribution with specified
mean and standard deviation. Thus, to find out what z value has a .05 probability we ask for
qnorm(.05). Or, to evaluate the probability of observing a z value of 2.5, specify pnorm(2.5).
(These last two examples are one side p values).
A.6 Entering or getting the data 443

A.5.2 Data structures

Data may be composed of


• Single elements which may be of type integer, real, complex, or character. Thus, i <- 1,
x <- 2.34, name <- "bill" are possible elements.
• Vectors and lists are collections of elements can be formed by combining individual ele-
ments: v <- c(i,x,name) or by a providing a rule: y <- 10:20 (creates a vector with elements
[10, 11, ..., 20]).
A suggestion about programming style that is not strictly R related: It is useful to label
variables with names that will make sense to you later when you look at an analysis several
months later. Thus, rather than calling variables x, y, and z, giving names that reflect that
they are in fact impulsivity, anxiety, and performance tends to be more useful.
For a more complete list of R commands, see Appendix B. A limited number of examples
are shown below. Examples of using R to do simple matrix operations are discussed in
Appendix E.

A.6 Entering or getting the data

For most data analysis, rather than manually enter the data into R, it is probably more
convenient to use a spreadsheet (e.g., Excel or OpenOffice) as a data editor, save as a tab
or comma delimited file, and then read the data from the file. Many of the examples in this
tutorial assume that the data have been entered this way. Many of the examples in the help
menus have small data sets entered using the c() command or created on the fly. It is also
possible to read data in from a remote file server.
Using the copy.clipboard() function from the psych package, it is also possible to have a
data file open in a text editor or spreadsheet program, copy the relevant lines to the clipboard,
and then read the clipboard directly into R.
For the first example, we read data from a remote file server for several hundred subjects
on 13 personality scales (5 from the Eysenck Personality Inventory (EPI), 5 from a Big Five
Inventory (BFI) , 1 Beck Depression, and two anxiety scales). The data are taken from a study
in the Personality, Motivation, and Cognition Laboratory at Northwestern University. The
file is structured normally, i.e. rows represent di↵erent subjects, columns di↵erent variables,
and the first row gives subject labels. Had we saved this file as comma delimited, we would
add the separation (sep=”,”) parameter.
To read a file from your local machine, change the datafilename to specify the path to the
data. Using the file.choose command, you can set a local file name to the data file name
anywhere on your computer.
#specify the name and address of the remote file
>datafilename <- "http://personality-project.org/r/datasets/maps.mixx.epi.bfi.data"
#Or, If I want to read a datafile from my desktop
#datafilename <- file.choose() #where you dynamically can go find the file

#now read the data file


>person.data <- read.table(datafilename,header=TRUE) #read the data file
444 A R: Getting started

>names(person.data) #list the names of the variables

> names(person.data) #list the names of the variables


[1] "epiE" "epiS" "epiImp" "epilie" "epiNeur" "bfagree" "bfcon"
[8] "bfext" "bfneur" "bfopen" "bdi" "traitanx" "stateanx"

The data are now in the data.frame “person.data”. Data.frames allow one to have columns
that are either numeric or alphanumeric. They are conceptually a generalization of a matrix
in that they have rows and columns, but unlike a matrix, some columns can be of di↵erent
“types” (integers, reals, characters, strings) than other columns.

A.7 Basic descriptive statistics

Basic descriptive statistics are most easily reported by using the summary, mean and Standard
Deviations (sd) commands. Using the describe function available in the psych package is
also convenient. Graphical displays that also capture this are available as a boxplot.

> summary(person.data) #print out the min, max, range, mean, median, etc. of the data
> round(mean(person.data),2) #means of all variables, rounded to 2 decimals
> round(sd(person.data),2) #standard deviations, rounded to 2 decimals
epiE epiS epiImp epilie epiNeur
Min. : 1.00 Min. : 0.000 Min. :0.000 Min. :0.000 Min. : 0.00
1st Qu.:11.00 1st Qu.: 6.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.: 7.00
Median :14.00 Median : 8.000 Median :4.000 Median :2.000 Median :10.00
Mean :13.33 Mean : 7.584 Mean :4.368 Mean :2.377 Mean :10.41
3rd Qu.:16.00 3rd Qu.: 9.500 3rd Qu.:6.000 3rd Qu.:3.000 3rd Qu.:14.00
Max. :22.00 Max. :13.000 Max. :9.000 Max. :7.000 Max. :23.00
bfagree bfcon bfext bfneur bfopen
Min. : 74.0 Min. : 53.0 Min. : 8.0 Min. : 34.00 Min. : 73.0
1st Qu.:112.0 1st Qu.: 99.0 1st Qu.: 87.5 1st Qu.: 70.00 1st Qu.:110.0
Median :126.0 Median :114.0 Median :104.0 Median : 90.00 Median :125.0
Mean :125.0 Mean :113.3 Mean :102.2 Mean : 87.97 Mean :123.4
3rd Qu.:136.5 3rd Qu.:128.5 3rd Qu.:118.0 3rd Qu.:104.00 3rd Qu.:136.5
Max. :167.0 Max. :178.0 Max. :168.0 Max. :152.00 Max. :173.0
bdi traitanx stateanx
Min. : 0.000 Min. :22.00 Min. :21.00
1st Qu.: 3.000 1st Qu.:32.00 1st Qu.:32.00
Median : 6.000 Median :38.00 Median :38.00
Mean : 6.779 Mean :39.01 Mean :39.85
3rd Qu.: 9.000 3rd Qu.:44.00 3rd Qu.:46.50
Max. :27.000 Max. :71.00 Max. :79.00

epiE epiS epiImp epilie epiNeur bfagree bfcon bfext bfneur


13.33 7.58 4.37 2.38 10.41 125.00 113.25 102.18 87.97
A.7 Basic descriptive statistics 445

bfopen bdi traitanx stateanx


123.43 6.78 39.01 39.85

epiE epiS epiImp epilie epiNeur bfagree bfcon bfext bfneur


4.14 2.69 1.88 1.50 4.90 18.14 21.88 26.45 23.34
bfopen bdi traitanx stateanx
20.51 5.78 9.52 11.48

A.7.1 Using functions in the psych package

The psych package has been developed particularly for simple psychometrics and exploratory
data of psychological data. It may be downloaded using the package installer from CRAN or
from the http:personality-project.org/r personality project.
Once downloaded and installed, it needs to be loaded before it can be used. The library
command does this.
Among the functions within the psych package are describe and pairs.panels.
> library(psych)
> describe(person.data)
var n mean sd median min max range se
epiE 1 231 13.33 4.14 14 1 22 21 0.27
epiS 2 231 7.58 2.69 8 0 13 13 0.18
epiImp 3 231 4.37 1.88 4 0 9 9 0.12
epilie 4 231 2.38 1.50 2 0 7 7 0.10
epiNeur 5 231 10.41 4.90 10 0 23 23 0.32
bfagree 6 231 125.00 18.14 126 74 167 93 1.19
bfcon 7 231 113.25 21.88 114 53 178 125 1.44
bfext 8 231 102.18 26.45 104 8 168 160 1.74
bfneur 9 231 87.97 23.34 90 34 152 118 1.54
bfopen 10 231 123.43 20.51 125 73 173 100 1.35
bdi 11 231 6.78 5.78 6 0 27 27 0.38
traitanx 12 231 39.01 9.52 38 22 71 49 0.63
stateanx 13 231 39.85 11.48 38 21 79 58 0.76

The describe function can be combined with the by function to provide even more detailed
tables. This example reports descriptive statistics for subjects with lie scores < 3 and those
>= 3. The second element in the by command could be a categorical variable (e.g., sex).
by(person.data,epilie<3,describe)

epilie < 3: FALSE

var n mean sd median min max range se


epiE 1 90 12.64 4.00 13.0 1 21 20 0.42
446 A R: Getting started

epiS 2 90 7.61 2.81 8.0 0 13 13 0.30


epiImp 3 90 3.97 1.67 4.0 1 8 7 0.18
epilie 4 90 3.89 1.08 4.0 3 7 4 0.11
epiNeur 5 90 9.33 5.20 9.0 0 20 20 0.55
bfagree 6 90 128.12 16.55 129.0 87 167 80 1.74
bfcon 7 90 117.56 20.46 118.0 58 178 120 2.16
bfext 8 90 100.88 25.24 101.0 24 151 127 2.66
bfneur 9 90 82.22 22.80 81.5 35 144 109 2.40
bfopen 10 90 121.97 20.55 121.0 75 172 97 2.17
bdi 11 90 5.77 4.71 5.0 0 24 24 0.50
traitanx 12 90 37.01 9.06 36.0 22 71 49 0.95
stateanx 13 90 38.41 11.36 36.5 21 69 48 1.20
---------------------------------------------------------------
epilie < 3: TRUE
var n mean sd median min max range se
epiE 1 141 13.77 4.17 14 4 22 18 0.35
epiS 2 141 7.57 2.62 8 1 13 12 0.22
epiImp 3 141 4.62 1.97 5 0 9 9 0.17
epilie 4 141 1.41 0.73 2 0 2 2 0.06
epiNeur 5 141 11.10 4.59 10 0 23 23 0.39
bfagree 6 141 123.00 18.88 124 74 165 91 1.59
bfcon 7 141 110.50 22.38 111 53 176 123 1.88
bfext 8 141 103.01 27.25 105 8 168 160 2.29
bfneur 9 141 91.64 23.01 94 34 152 118 1.94
bfopen 10 141 124.36 20.50 126 73 173 100 1.73
bdi 11 141 7.43 6.29 6 0 27 27 0.53
traitanx 12 141 40.28 9.62 39 23 71 48 0.81
stateanx 13 141 40.77 11.51 39 23 79 56 0.97

A.8 Simple Graphics

There are a variety of ways of graphically reporting the data. One is the box plot (boxplot)
to show the Tukey 5 numbers (upper and lower hinges, upper and lower quartiles, median).
Another way to grasp the distribution of the data is to overlay the actual data points with
a stripchart.

boxplot(person.data[,1:5])
stripchart(person.data[,1:5],vertical=T,add=T,method="jitter",jitter=.2) #add in the points
Another way of describing the data is to graph them. boxplot show the top and bottom
quartiles, medians, and the ”hinges”. histograms show the distribution in more detail. The
pairs.panels command draws a matrix of scatter plots. (Note that just the first five variables
are shown in the SPLOM to make it more readable).
pairs.panels(person.data[,1:5])
A.8 Simple Graphics 447

20
15
10
5
0

epiE epiS epiImp epilie epiNeur

Fig. A.1 A boxplot with an added stripchart summarizes basic distributional properties of the data.

0 2 4 6 8 10 0 1 2 3 4 5 6 7

10 15 20
epiE

0.85 0.80 -0.22 -0.18


5
12

epiS

0.43 -0.05 -0.22


8
4

~
0

epiImp
8
6

-0.24 -0.07
4
2
0

epilie
6

-0.25
4
2
0

20

epiNeur
5 10
0

5 10 15 20 0 2 4 6 8 0 5 10 15 20

Fig. A.2 A scatter plot matrix of the data can be modified to give histograms as well as the correla-
tions.

~
Appendix B
R commands

(Very rough summary of the most useful command)

B.1 Input and display

#read files with labels in first row


read.table(filename,header=TRUE) #read a tab or space delimited file
read.table(filename,header=TRUE,sep=’,’) #read csv files (comma separated)

x=c(1,2,4,8,16 ) #create a data vector with specified elements


y=c(1:8,1:4) #creat a data vector with 12 entries
matr=rbind(1:8,1:4) #create two rows in a 2 * 8 matrix
matc=cbind(1:8,1:4) #create two columns in a 8 * 2 matrix
n=10
x1=c(rnorm(n)) #create a n item vector of random normal deviates
y1=c(runif(n))+n #create another n item vector that has n added to each random uniform
distribution
z=rbinom(n,size,prob) #create n samples of size ”size” with probability prob from the bino-
mialitem
sample(x, size, replace = FALSE, prob = NULL) #take a sample (with or without replace-
ment) of size from x

vect=c(x,y) #combine them into one vector of length 2n


mat=cbind(x,y) #combine them into a n x 2 matrix (column wise)
mat[4,2] #display the 4th row and the 2nd column
mat[3,] #display the 3rd row
mat[,2] #display the 2nd column
mat=cbind(rep(1:4,2),rep(4:1,2)) #create a 8 * 2 matrix with repeating elements
subset(data,logical) #those objects meeting a logical criterion
subset(data.df,select=variables,logical) #get those objects from a data frame that meet a
criterion

363
364 B R commands

B.2 moving around

ls() #list the variables in the workspace


rm(x) #remove x from the workspace
rm(list=ls()) #remove all the variables from the workspace
attach(mat) #make the names of the variables in the matrix available
detach(mat) #releases the names
new=old[,-n] #drop the nth column
new=old[n,] #drop the nth row
new=subset(old,logical) #select those cases that meet the logical condition
complete = subset(data,complete.cases(data)) #find those cases with no missing values
new=old[n1:n2,n3:n4] #select the n1 through n2 rows of variables n3 through n4)

B.3 data manipulation

x.df=data.frame(x1,x2,x3 ...) #combine different kinds of data into a data frame


as.data.frame()
is.data.frame()
x=as.matrix()
scale() #converts a data frame to standardized scores
factor() #converts a numeric variable into a factor (essential for ANOVA)
gl(n,k,length) #makes an n-level,k replicates, length long vectof factors
y <- edit(x) #opens a screen editor and saves changes made to x intoy
fix(x) #opens a screen editor window and makes and saves changes to x

B.4 Statistics and transformations

max()
min()
mean()
median()
interp.median() #for interpolated values sum()
var() #produces the variance covariance matrix
sd() #standard deviation
mad() #(median absolute deviation)
fivenum() #Tukey fivenumbers min, lowerhinge, median, upper hinge, max
scale(data,scale=T) #centers around the mean and scales by the sd)
colSums(), rowSums(), colMeans(), rowMeans() #see also apply(x,1,sum)
rowsum(x,group) #sum by group
cor(x,y,use=”pair”) #correlation matrix for pairwise complete data, use=”complete” for com-
plete cases
B.6 Graphics 365

t.test(x,y) #x is a data vector, y is a grouping vector independent groups


t.test(x,y,pair=TRUE) #x is a data vector, y is a grouping vector – paired groups
pairwise.t.test(x,g) does multiple comparisons of all groups defined by g
aov(x y,data=datafile) #where x and y can be matrices
aov.ex1 = aov(Alertness Dosage,data=data.ex1) #do the analysis of variance or
aov.ex2 = aov(Alertness Gender*Dosage,data=data.ex2) #do a two way analysis of variance
summary(aov.ex1) #show the summary table
print(model.tables(aov.ex1,”means”),digits=3) #report the means and the number of sub-
jects/cell
boxplot(Alertness Dosage,data=data.ex1) #graphical summary appears in graphics window

lm(x y,data=dataset) #basic linear model where x and y can be matrices


lm(Y X) #Y and X can be matrices
lm(Y X1+X2)
lm(Y X|W) #separate analyses for each level of W
solve(A,B) #inverse of A * B - used for linear regression
solve(A) #inverse of A

B.5 Useful additional commands

colSums (x, na.rm = FALSE, dims = 1)


rowSums (x, na.rm = FALSE, dims = 1)
colMeans(x, na.rm = FALSE, dims = 1)
rowMeans(x, na.rm = FALSE, dims = 1)
rowsum(x, group, reorder = TRUE, ...) #finds row sums for each level of a grouping variable
apply(X, MARGIN, FUN, ...) #applies the function (FUN) to either rows (1) or columns (2)
on object X
apply(x,1,min) #finds the minimum for each row
apply(x,2,max) #finds the maximum for each column
col.max(x) #another way to find which column has the maximum value for each row
which.min(x)
which.max(x)
z=apply(big5r,1,which.min) #tells the row with the minimum value for every column

B.6 Graphics

stem() #stem and leaf diagram

par(mfrow=c(2,1)) #number of rows and columns to graph

boxplot(x,notch=T,names= grouping, main=”title”) #boxplot (box and whiskers)


366 B R commands

hist() #histogram
plot()
plot(x,y,xlim=range(-1,1),ylim=range(-1,1),main=title)
par(mfrow=c(1,1)) #change the graph window back to one figure
symb=c(19,25,3,23)
colors=c(”black”,”red”,”green”,”blue”)
charact=c(”S”,”T”,”N”,”H”)
plot(x,y,pch=symb[group],col=colors[group],bg=colors[condit],cex=1.5,main=”main title”)
points(mPA,mNA,pch=symb[condit],cex=4.5,col=colors[condit],bg=colors[condit])

curve()
abline(a,b)
abline(a, b, untf = FALSE, ...)
abline(h=, untf = FALSE, ...)
abline(v=, untf = FALSE, ...)
abline(coef=, untf = FALSE, ...)
abline(reg=, untf = FALSE, ...)

identify()
plot(eatar,eanta,xlim=range(-1,1),ylim=range(-1,1),main=title)
identify(eatar,eanta,labels=labels(energysR[,1]) ) #dynamically puts names on the plots
locate()
pairs() #SPLOM (scatter plot Matrix)

matplot () #ordinate is row of the matrix


biplot () #factor loadings and factor scores on same graph
coplot(x y|z) #x by y conditioned on z
symb=c(19,25,3,23) #choose some nice plotting symbols
colors=c(”black”,”red”,”green”,”blue”) #choose some nice colors

barplot() #simple bar plot


interaction.plot () #shows means for an ANOVA design

plot(degreedays,therms) #show the data points


by(heating,Location,function(x) abline(lm(therms degreedays,data=x))) #show the best fit-
ting regression for each group

x= recordPlot() #save the current plot device output in the object x


replayPlot(x) #replot object x
dev.control #various control functions for printing/saving graphic files
B.6 Graphics 367

Table B.1 Functions used in this chapter. *are part of the psych package.

Function Use Example


? Help about a function. Same as help. ?round or help(round)
%*% Binary operator to do vector or matrix multiplication A %*% B
%+% *Binary operator to do vector or matrix like sums A %+% B
== Test of equality A == B
<- Assignment A is replaced by B. This notation is preferred to = A <- B
= Assignment A is replaced by B. (see <- ) A=B
[,] Evaluate an element of a matrix , array or data.frame. A[i,j] is xi j = X[i, j]
the ith row, jth column.
$ Evaluate an element of a list or data.frame. A$B is the element a.b = A$B
with name B in A.
as.vector() Make a set of numbers into a vector A <- as.vector(A)
c() Combine two or more items. A <- c(B,C)
colnames() Find or make the column names (also see rownames) A <- colnames(B) finds col-
rownames() names(A) <- B makes
colMeans() Find column or row means of each column or row in a data.frame Ac <- colMeans(A) Ar <-
rowMeans() or matrix rowMeans(A)
curve() Plot the curve for a specified function. curve(1/(1+exp(-x)),-3,3)
data.frame() Create a data.frame. (Similar to a matrix, but can have different A <- data.frame(x,y,z)
types of elements)
describe() *Report basic descriptive statistics for a vector, matrix, or describe(X)
dataframe
diag() Create or find the diagonal of a square matrix A <- diag(B) finds
diag(B) <- A creates
dim Report the dimensions (rows,cols) of a data.frame or matrix. n.rows <- dim(x.df)[1]
exp() Raise e to the A power exp(A)
for() Execute a loop from start to finish for (i in start:finish) {some
operation using i}
function() Create a new function to do something. new.f <- function(x,y) {
new <- x + y}
if() ... else Do something if a logical condition holds. Do something else if it if(A < B) {print(“B”) } else
does not hold. {print(“A”) }
length() Report the number of elements in a vector n <- length(v)
list() A general way of storing results. A <- list(a=1,b=2,c=3.4)
log() Find the natural logarithm of X A <- log(X)
lower.tri() Logical function, true if an element is in lower triangular subma- A <- lower.tri(B)
trix of a matrix (see upper.tri)
matrix() Create a matrix of m*n elements with m rows and n columns A <- matrix(B,ncol=n)
mean() median() Find the mean, or median of a data.frame, vector, or matrix A.m <- mean(A)
model.fit() *Calculate 3 alternative goodness of fit indices (see text)
paste() Combine several numeric or text variable into a string A <- paste(‘B’,‘is’,‘4’)
pairs.panels() *Plot the scatter plot matrices (SPLOM) and report the corre- pairs.panels(x.df)
lations for a data.frame or matrix
pnorm() What is the probability of an observation, z, given the normal p <- pnorm(z)
distribution −∞ < z < ∞
qnorm() What is the z score associated with a particular quantile (prob- z <- qnorm(x)
ability) in a normal distribution 0 < x < 1
read.clipboard() *Read a data matrix or data table from the clipboard A <- read.clipboard()
rep() Repeat A N times rep(A,N)
round() Round off numbers to n digits round(x,n)
runif() Create n random numbers, uniformly distributed between a and runif(n,a,b)
b. (Defaults to a=0, b=1)
sample() Draw n samples (with or without replacement) from x sample(2,100))
set.seed() Supply a particular start value to the random number generator set.seed(42)
seq() Form the sequence from lower to upper by step size x <- seq(lower,upper,step)
t() Transpose a vector or matrix ta <- t(A)
upper.tri() Logical function, true if an element is in upper triangular sub- A <- lower.tri(B)
matrix of a matrix (see lower.tri)
Appendix E
Appendix: A Review of Matrices

Although first used by the Babylonians to solve simultaneous equations, and discussed in the
Nine Chapters on Mathematical Art in China ca. 300-200 BCE, matrices were not introduced
into psychological research until Thurstone first used the word matrix in 1933 (Bock, 2007).
Until then, data and even correlations were organized into “tables”. Vectors, matrices and
arrays are merely convenient ways to organize objects (usually numbers) and with the intro-
duction of matrix notation, the power of matrix algebra was unleashed for psychometrics.
Much of psychometrics in particular, and psychological data analysis in general consists of
operations on vectors and matrices. In many commercial software applications, some of the
functionality of matrices is seen in the use of “spreadsheets”. Many commercial statistical
packages do the analysis in terms of matrices but shield the user from this fact. This is
unfortunate, because it is (with some practice) easier to understand the similarity of many
algorithms when they are expressed in matrix form.
This appendix o↵ers a quick review of matrix algebra with a particular emphasis upon
how to do matrix operations in R. The later part of the appendix shows how some fairly
complex psychometrics concepts are done easily in terms of matrices.

E.1 Vectors

A vector is a one dimensional array of n elements where the most frequently used elements
are integers, reals (numeric), characters, or logical. Basic operations on a vector are addition,
subtraction and multiplication. Although addition and subtraction are straightforward, mul-
tiplication is somewhat more complicated, for the order in which two vectors are multiplied
changes the result. That is ab 6= ba. (In an attempt at consistent notation, vectors will be
bold faced lower case letters.)
Consider v1 = the first 6 integers, and v2 = the next 6 integers:
> v1 <- seq(1, 6)
> v2 <- seq(7, 12)
> v1
[1] 1 2 3 4 5 6
> v2
[1] 7 8 9 10 11 12

459
460 E Appendix: A Review of Matrices

We can add a constant to each element in a vector, add each element of the first vector
to the corresponding element of the second vector, multiply each element by a scaler , or
multiply each element in the first by the corresponding element in the second:
> v3 <- v1 + 20
> v4 <- v1 + v2
> v5 <- v1 * 3
> v6 <- v1 * v2

> v3
[1] 21 22 23 24 25 26
> v4
[1] 8 10 12 14 16 18
> v5
[1] 3 6 9 12 15 18
> v6
[1] 7 16 27 40 55 72

E.1.1 Vector multiplication

Strangely enough, a vector in R is dimensionless, but it has a length. There are three types of
multiplication of vectors in R. Simple multiplication (each term in one vector is multiplied by
its corresponding term in the other vector ( e.g., v6 <- v1 ⇤ v2), as well as the inner and outer
products of two vectors. The inner product is a very powerful operation for it combines both
multiplication and addition. That is, for two vectors of the same length, the inner product
of v1 and v2 is found by the matrix multiply operator %*%
0 1
7
B 8C
B C n n
B 9C
1 2 3 4 5 6 %⇤%B C = Â v1i v2i = Â v6 = 217 (E.1)
B 10 C
B C i=1 i=1
@ 11 A
12
In the previous example, because of the way R handles vectors, and because v1 and
v2 were of the same length, it was not necessary to worry about rows and columns and
v2% ⇤ %v1 = v1% ⇤ %v2. In general, however, the multiplication of two vectors will yield dif-
ferent results depending upon the order. A row vector times a column vector of the same
length produces a single element which is equal to the sum of the products of the respective
elements. But a column vector of length c times a row vector of length r times results in
the c x r outer product matrix of products. To see this, consider the vector v7 = seq(1,4)
and the results of v1% ⇤ %v7 versus v7% ⇤ %v1. Unless otherwise specfied, all vectors may be
thought of as column vectors. To force v7 to be a row vector, use the transpose function t.
To transpose a vector changes a column vector into a row vector and a row vector into a
column vector. It is shown with the superscript T or sometimes with the superscript ’.
Then v1 % ⇤ % v70 = V8 and v7 % ⇤ % v10 = V9 . To clarify this notation, note that the
(6x1) (1x4) (6x4) (4x1) (1x6) (4x6)
first subscript of each vector refers to the number of rows and the second to the number of
E.1 Vectors 461

columns in a matrix. Matrices are written in bold face upper case letters. For a vector, of
course, either the number of columns or rows is 1. Note also that for the multiplication to
be done, the inner subscripts (e.g., 1 and 1 in this case) must correspond, but that the outer
subscripts (e.g., 6 and 4) do not.
0 1 0 1
1 1 2 3 4
B2C B2 4 6 8 C
B C B C
B3C B3 6 9 12 C
v1 % ⇤ % v70 = B C % ⇤ % 1 2 3 4 = B C = V8 (E.2)
(6x1) (1x4) B4C B4 8 12 16 C (6x4)
@5A @ 5 10 15 20 A
6 6 12 18 24

but
0 1 0 1
1 1 2 3 4 5 6
B 2 C B 6 8 10 12 C
v7 % ⇤ % v10 = B C%⇤% 1 2 3 4 5 6 = B 2 4 C = V9 (E.3)
(4x1) (1x6) @ 3 A @3 6 9 12 15 18 A (4x6)
4 4 8 12 16 20 24
That is, in R
> v7 <- seq(1,4)
> V8 <- v1 %*% t(v7)
> V9 <- v7 %*% t(v1)

v1 %*% t(v7)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 4 6 8
[3,] 3 6 9 12
[4,] 4 8 12 16
[5,] 5 10 15 20
[6,] 6 12 18 24

v7 %*% t(v1)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
[2,] 2 4 6 8 10 12
[3,] 3 6 9 12 15 18
[4,] 4 8 12 16 20 24

and v7 % ⇤ % v1 = V9 6= V8 .
(4x1) (1x6) (4x6) (6x4)

E.1.2 Simple statistics using vectors

Although there are built in functions in R to do most of our statistics, it is useful to understand
how these operations can be done using vector and matrix operations. Here we consider how
to find the mean of a vector, remove it from all the numbers, and then find the average
squared deviation from the mean (the variance).
462 E Appendix: A Review of Matrices

Consider the mean of all numbers in a vector. To find this we just need to add up the
numbers (the inner product of the vector with a vector of 1’s) and then divide by n (multiply
by the scaler 1/n). First we create a vector, v and then a second vector one of 1s by using
the repeat operation.
> v <- seq(1, 7)
> one <- rep(1,length(v))
> sum.v <- t(one) %*% v
> sum.v
[,1]
[1,] 28

> mean.v <- sum.v * (1/length(v))


[,1]
[1,] 4
> mean.v <- t(one) %*% v * (1/length(v))

> v
[1] 1 2 3 4 5 6 7
> one
[1] 1 1 1 1 1 1 1
> sum.v
[,1]
[1,] 28
The mean may be calculated in three di↵erent ways, all of which are equivalent.
> mean.v <- t(one) %*% v/length(v)

> sum.v * (1/length(v))


[,1]
[1,] 4
> t(one) %*% v * (1/length(v))
[,1]
[1,] 4
> t(one) %*% v/length(v)
[,1]
[1,] 4
As vectors, this was
0 1
1
B2C
B C
n
B3C
1 B C 1
 vi /n = 1 v ⇤ n = 1 1 1 1 1 1 1 B
T C
B4C⇤ 7 = 4 (E.4)
B C
B5C
1
@6A
7
E.1 Vectors 463

The variance is the average squared deviation from the mean. To find the variance, we first
find deviation scores by subtracting the mean from each value of the vector. Then, to find
the sum of the squared deviations take the inner product of the result with itself. This Sum
of Squares becomes a variance if we divide by the degrees of freedom (n-1) to get an unbiased
estimate of the population variance. First we find the mean centered vector:
> v - mean.v
[1] -3 -2 -1 0 1 2 3
And then we find the variance as the mean square by taking the inner product:
> Var.v <- t(v - mean.v) %*% (v - mean.v) * (1/(length(v) - 1))
Var.v
[,1]
[1,] 4.666667
Compare these results with the more typical scale, mean and var operations:
> scale(v, scale = FALSE)
[,1]
[1,] -3
[2,] -2
[3,] -1
[4,] 0
[5,] 1
[6,] 2
[7,] 3
attr(,"scaled:center")
[1] 4
> mean(v)
[1] 4
> var(v)
[1] 4.666667

E.1.3 Combining vectors with cbind and rbind

To combine two or more vectors with the result being a vector, use the c function.
> x <- c(v1, v2,v3)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 21 22 23 24 25 26
We can form more complex data structures than vectors by combining the vectors, either
by columns (cbind) or by rows (rbind). The resulting data structure is a matrix with the
number of rows and columns depending upon the number of vectors combined, and the
number of elements in each vector.
> Xc <- cbind(v1, v2, v3)
464 E Appendix: A Review of Matrices

V1 V2 V3
[1,] 1 7 21
[2,] 2 8 22
[3,] 3 9 23
[4,] 4 10 24
[5,] 5 11 25
[6,] 6 12 26
> Xr <- rbind(v1, v2, v3)
[,1] [,2] [,3] [,4] [,5] [,6]
V1 1 2 3 4 5 6
V2 7 8 9 10 11 12
V3 21 22 23 24 25 26
> dim(Xc)
[1] 6 3
> dim(Xr)
[1] 3 6

E.2 Matrices

A matrix is just a two dimensional (rectangular) organization of numbers. It is a vector of


vectors. For data analysis, the typical data matrix is organized with rows containing the
responses of a particular subject and the columns representing di↵erent variables. Thus, a
6 x 4 data matrix (6 rows, 4 columns) would contain the data of 6 subjects on 4 di↵erent
variables. In the example below the matrix operation has taken the numbers 1 through 24
and organized them column wise. That is, a matrix is just a way (and a very convenient one
at that) of organizing a data vector in a way that highlights the correspondence of multiple
observations for the same individual.
R provides numeric row and column names (e.g., [1,] is the first row, [,4] is the fourth
column, but it is useful to label the rows and columns to make the rows (subjects) and
columns (variables) distinction more obvious. We do this using the rownames and colnames
functions, combined with the paste and seq functions.
> Xij <- matrix(seq(1:24), ncol = 4)
> rownames(Xij) <- paste("S", seq(1, dim(Xij)[1]), sep = "")
> colnames(Xij) <- paste("V", seq(1, dim(Xij)[2]), sep = "")
> Xij
V1 V2 V3 V4
S1 1 7 13 19
S2 2 8 14 20
S3 3 9 15 21
S4 4 10 16 22
S5 5 11 17 23
S6 6 12 18 24
E.2 Matrices 465

Just as the transpose of a vector makes a column vector into a row vector, so does the
transpose of a matrix swap the rows for the columns. Applying the t function to the matrix
Xij produces Xij0 . Note that now the subjects are columns and the variables are the rows.
> t(Xij)
S1 S2 S3 S4 S5 S6
V1 1 2 3 4 5 6
V2 7 8 9 10 11 12
V3 13 14 15 16 17 18
V4 19 20 21 22 23 24

E.2.1 Adding or multiplying a vector and a Matrix

Just as we could with vectors, we can add, subtract, multiply or divide the matrix by a scalar
(a number without a dimension).
> Xij + 4

V1 V2 V3 V4
S1 5 11 17 23
S2 6 12 18 24
S3 7 13 19 25
S4 8 14 20 26
S5 9 15 21 27
S6 10 16 22 28
> round((Xij + 4)/3, 2)
V1 V2 V3 V4
S1 1.67 3.67 5.67 7.67
S2 2.00 4.00 6.00 8.00
S3 2.33 4.33 6.33 8.33
S4 2.67 4.67 6.67 8.67
S5 3.00 5.00 7.00 9.00
S6 3.33 5.33 7.33 9.33
We can also add or multiply each row (or column, depending upon order) by a vector.
This is more complicated that it would appear, for R does the operations columnwise. This
is best seen in an example:
> v <- 1:4
[1] 1 2 3 4
> Xij + v
V1 V2 V3 V4
S1 2 10 14 22
S2 4 12 16 24
466 E Appendix: A Review of Matrices

S3 6 10 18 22
S4 8 12 20 24
S5 6 14 18 26
S6 8 16 20 28
> Xij * v
V1 V2 V3 V4
S1 1 21 13 57
S2 4 32 28 80
S3 9 9 45 21
S4 16 20 64 44
S5 5 33 17 69
S6 12 48 36 96
These are not the expected results if the intent was to add or multiply a di↵erent number to
each column! R operates on the columns and wraps around to the next column to complete
the operation. To add the n elements of v to the n columns of Xij, use the t function to
transpose Xij and then transpose the result back to the original order:
> t(t(Xij) + v)
V1 V2 V3 V4
S1 2 9 16 23
S2 3 10 17 24
S3 4 11 18 25
S4 5 12 19 26
S5 6 13 20 27
S6 7 14 21 28
> V10 <- t(t(Xij) * v)
> V10
V1 V2 V3 V4
S1 1 14 39 76
S2 2 16 42 80
S3 3 18 45 84
S4 4 20 48 88
S5 5 22 51 92
S6 6 24 54 96
To find a matrix of deviation scores, just subtract the means vector from each cell. The
scale function does this with the option scale=FALSE. The default for scale is to convert
a matrix to standard scores.
> scale(V10,scale=FALSE)
> scale(V10,scale=FALSE)
V1 V2 V3 V4
S1 -2.5 -5 -7.5 -10
S2 -1.5 -3 -4.5 -6
S3 -0.5 -1 -1.5 -2
E.2 Matrices 467

S4 0.5 1 1.5 2
S5 1.5 3 4.5 6
S6 2.5 5 7.5 10
attr(,"scaled:center")
V1 V2 V3 V4
3.5 19.0 46.5 86.0

E.2.2 Matrix multiplication

Matrix multiplication is a combination of multiplication and addition and is one of the most
used and useful matrix operations. For a matrix X of dimensions r*p and Y of dimension
(rxp) (pxc)
p * c, the product, X Y , is a r * c matrix where each element is the sum of the products
(rxp)(pxc)
of the rows of the first and the columns of the second. That is, the matrix XY has elements
(rxc)
xyi j where each
n
xyi j = Â xik ⇤ yk j
k=1

The resulting xi j cells of the product matrix are sums of the products of the column elements
of the first matrix times the row elements of the second. There will be as many cell as there
are rows of the first matrix and columns of the second matrix.
! 0 1 0 p p
1
x11 x12 x13 x14 y11 y12
x21 x22 x23 x24 | B y21 y22 C B Â x1i yi1 Â x1i yi2 C
B i C
XY = B
!# @ C =B p i
A p C
(rx x p)(p x cy ) y y
31 32 @ A
y41 y42 Â x 2i y i1 Â x2i yi2
i i

It should be obvious that matrix multiplication is a very powerful operation, for it repre-
sents in one product the r * c summations taken over p observations.

E.2.2.1 Using matrix multiplication to find means and deviation scores

Matrix multiplication can be used with vectors as well as matrices. Consider the product of a
vector of ones, 1, and the matrix Xij with 6 rows of 4 columns. Call an individual element in
(rxc)
this matrix xi j . Then the sum for each column of the matrix is found multiplying the matrix
by the “one” vector with Xij. Dividing each of these resulting sums by the number of rows
(cases) yields the mean for each column. That is, find
n
10 Xij = Â Xi j
i=1

for the c columns, and then divide by the number (n) of rows. Note that the same result is
found by the colMeans(Xij) function.
468 E Appendix: A Review of Matrices

0 1
1 7 13 19
⇣ ⌘ BB2 8 14 20 C
C
1 1 1 1 1 1 1 | B
21 C
10 Xij = !# B
3 9 15 C 1 = 21 57 93 129 1 = 3.5 9.5 15.5 21.5
B 22 C
n B4 10 16 C6 6
@5 11 17 23 A
6 12 18 24

We can use the dim function to find out how many cases (the number of rows) or the
number of variables (number of columns). dim has two elements: dim(Xij)[1] = number of
rows, dim(Xij)[2] is the number of columns.
> dim(Xij)
[1] 6 4
> one <- rep(1,dim(Xij)[1]) #a vector of 1s
> t(one) %*% Xij #find the column sum
V1 V2 V3 V4
[1,] 21 57 93 129
> X.means <- t(one) %*% Xij /dim(Xij)[1] #find the column average
V1 V2 V3 V4
3.5 9.5 15.5 21.5
A built in function to find the means of the columns is colMeans. (See rowMeans for the
equivalent for rows.)
> colMeans(Xij)
V1 V2 V3 V4
3.5 9.5 15.5 21.5
To form a matrix of deviation scores, where the elements of each column are deviations
from that column mean, it is necessary to either do the operation on the transpose of the
Xij matrix, or to create a matrix of means by premultiplying the means vector by a vector
of ones and subtracting this from the data matrix.
> X.diff <- Xij - one %*% X.means
> X.diff
V1 V2 V3 V4
S1 -2.5 -2.5 -2.5 -2.5
S2 -1.5 -1.5 -1.5 -1.5
S3 -0.5 -0.5 -0.5 -0.5
S4 0.5 0.5 0.5 0.5
S5 1.5 1.5 1.5 1.5
S6 2.5 2.5 2.5 2.5
This can also be done by using the scale function which will mean center each column and
(by default) standardize by dividing by the standard deviation of each column.
E.2 Matrices 469

E.2.2.2 Using matrix multiplication to find variances and covariances

Variances and covariances are measures of dispersion around the mean. We find these by first
subtracting the means from all the observations. This means centered matrix is the original
matrix minus a vector of means. To make a more interesting data set, randomly order (in
this case, sample without replacement) from the items in Xij and then find the X.means and
X.diff matrices.
> set.seed(42) #set random seed for a repeatable example
> Xij <- matrix(sample(Xij),ncol=4) #random sample from Xij
> rownames(Xij) <- paste("S", seq(1, dim(Xij)[1]), sep = "")
> colnames(Xij) <- paste("V", seq(1, dim(Xij)[2]), sep = "")
> Xij
V1 V2 V3 V4
S1 22 14 12 15
S2 24 3 17 6
S3 7 11 5 4
S4 18 16 9 21
S5 13 23 8 2
S6 10 19 1 20
> X.means <- t(one) %*% Xij /dim(Xij)[1] #find the column average
> X.diff <- Xij -one %*% X.means
> X.diff
V1 V2 V3 V4
S1 6.333333 -0.3333333 3.3333333 3.666667
S2 8.333333 -11.3333333 8.3333333 -5.333333
S3 -8.666667 -3.3333333 -3.6666667 -7.333333
S4 2.333333 1.6666667 0.3333333 9.666667
S5 -2.666667 8.6666667 -0.6666667 -9.333333
S6 -5.666667 4.6666667 -7.6666667 8.666667

Compare this result to just using the scale function to mean center the data:
X.cen <- scale(Xij,scale=FALSE).
To find the variance/covariance matrix, find the the matrix product of the means centered
matrix X.diff with itself and divide by n-1. Compare this result to the result of the cov
function (the normal way to find covariances). The di↵erences between these two results is
the rounding to whole numbers for the first, and to two decimals in the second.
> X.cov <- t(X.diff) %*% X.diff /(dim(X.diff)[1]-1)
> round(X.cov)

V1 V2 V3 V4
V1 46 -23 34 8
V2 -23 48 -25 12
V3 34 -25 31 -12
V4 8 12 -12 70
470 E Appendix: A Review of Matrices

> round(cov(Xij),2)
V1 V2 V3 V4
V1 45.87 -22.67 33.67 8.13
V2 -22.67 47.87 -24.87 11.87
V3 33.67 -24.87 30.67 -12.47
V4 8.13 11.87 -12.47 70.27

E.2.3 Finding and using the diagonal

Some operations need to find just the diagonal of the matrix. For instance, the diagonal of the
matrix X.cov (found above) contains the variances of the items. To extract just the diagonal,
or create a matrix with a particular diagonal we use the diag command. We can convert
the covariance matrix X.cov to a correlation matrix X.cor by pre and post multiplying the
covariance matrix with a diagonal matrix containing the reciprocal of the standard deviations
(square roots of the variances). Remember (Chapter 4 that the correlation, rxy , is the ratio
of the covariance to the squareroot of the product of the variances:
Cxy
p .
VxVy

> X.var <- diag(X.cov)


V1 V2 V3 V4
45.86667 47.86667 30.66667 70.26667
> sdi <- diag(1/sqrt(diag(X.cov)))
> rownames(sdi) <- colnames(sdi) <- colnames(X.cov)
> round(sdi, 2)
V1 V2 V3 V4
V1 0.15 0.00 0.00 0.00
V2 0.00 0.14 0.00 0.00
V3 0.00 0.00 0.18 0.00
V4 0.00 0.00 0.00 0.12
> X.cor <- sdi %*% X.cov %*% sdi #pre and post multiply by 1/sd
> rownames(X.cor) <- colnames(X.cor) <- colnames(X.cov)
> round(X.cor, 2)
V1 V2 V3 V4
V1 1.00 -0.48 0.90 0.14
V2 -0.48 1.00 -0.65 0.20
V3 0.90 -0.65 1.00 -0.27
V4 0.14 0.20 -0.27 1.00
Compare this to the standard command for finding correlations cor.
> round(cor(Xij), 2)
E.2 Matrices 471

V1 V2 V3 V4
V1 1.00 -0.48 0.90 0.14
V2 -0.48 1.00 -0.65 0.20
V3 0.90 -0.65 1.00 -0.27
V4 0.14 0.20 -0.27 1.00

E.2.4 The Identity Matrix

The identity matrix is merely that matrix, which when multiplied by another matrix, yields
the other matrix. (The equivalent of 1 in normal arithmetic.) It is a diagonal matrix with 1
on the diagonal.

> I <- diag(1,nrow=dim(X.cov)[1])


[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1

E.2.5 Matrix Inversion

The inverse of a square matrix is the matrix equivalent of dividing by that matrix. That is,
either pre or post multiplying a matrix by its inverse yields the identity matrix. The inverse
is particularly important in multiple regression, for it allows us to solve for the beta weights.
Given the equation
ŷ = bX + c
we can solve for b by multiplying both sides of the equation by X’ to form a square matrix
XX0 and then take the inverse of that square matrix:
1
yX0 = bXX0 <=> b = yX0 (XX0 )

We can find the inverse by using the solve function. To show that XX 1 = X 1 X = I, we
do the multiplication.
> X.inv <- solve(X.cov)
V1 V2 V3 V4
V1 0.9872729 -0.14437764 -1.3336591 -0.32651075
V2 -0.1443776 0.05727259 0.2239567 0.04677367
V3 -1.3336591 0.22395665 1.8598554 0.44652285
V4 -0.3265108 0.04677367 0.4465228 0.12334760
> round(X.cov %*% X.inv, 2)
472 E Appendix: A Review of Matrices

V1 V2 V3 V4
V1 1 0 0 0
V2 0 1 0 0
V3 0 0 1 0
V4 0 0 0 1
> round(X.inv %*% X.cov, 2)
V1 V2 V3 V4
V1 1 0 0 0
V2 0 1 0 0
V3 0 0 1 0
V4 0 0 0 1
There are multiple ways of finding the matrix inverse, solve is just one of them. Ap-
pendix E.4.1 goes into more detail about how inverses are used in systems of simultaneous
equations. Chapter 5 considers the use of matrix operations in multiple regression.

E.2.6 Eigenvalues and Eigenvectors

The eigenvectors of a matrix are said to provide a basis space for the matrix. This is a set of
orthogonal vectors which when multiplied by the appropriate scaling vector of eigenvalues
will reproduce the matrix.
Given a n ⇤ n matrix R, each eigenvector solves the equation

xi R = li xi

and the set of n eigenvectors are solutions to the equation

XR = l X

where X is a matrix of orthogonal eigenvectors and l is a diagonal matrix of the the eigen-
values, li . Then
xi R li XI = 0 <=> xi (R li I) = 0.
Finding the eigenvectors and values is computationally tedious, but may be done using the
eigen function which uses a QR decomposition of the matrix. That the vectors making up
X are orthogonal means that
XX0 = I
and because they form the basis space for R that

R = Xl X0 .

That is, it is possible to recreate the correlation matrix R in terms of an orthogonal set of
vectors (the eigenvectors) scaled by their associated eigenvalues. (See 6.1.1 and Table 6.2 for
an example of an eigenvalue decomposition using the eigen function.
The sum of the eigenvalues for a correlation matrix is the rank of the correlation matrix.
The product of the eigenvalues is the determinant of the matrix.
E.3 Matrix operations for data manipulation 473

E.2.7 Determinants

The determinant of an n * n correlation matrix may be thought of as the proportion of


the possible n-space spanned by the variable space and is sometimes called the generalized
variance of the matrix. As such, it can also be considered as the volume of the variable space.
If the correlation matrix is thought of a representing vectors within a n dimensional space,
then the square roots of the eigenvalues are the lengths of the axes of that space. The product
of these, the determinant, is then the volume of the space. It will be a maximum when the
axes are all of unit length and be zero if at least one axis is zero. Think of a three dimensional
sphere (and then generalize to a n dimensional hypersphere.) If it is squashed in a way that
preserves the sum of the lengths of the axes, then volume of the oblate hyper sphere will be
reduced.
The determinant is an inverse measure of the redundancy of the matrix. The smaller the
determinant, the more variables in the matrix are measuring the same thing (are correlated).
The determinant of the identity matrix is 1, the determinant of a matrix with at least
two perfectly correlated (linearly dependent) rows or columns will be 0. If the matrix is
transformed into a lower diagonal matrix, the determinant is the product of the diagonals.
The determinant of a n * n square matrix, R is also the product of the n eigenvalues of that
matrix.
n
det(R) = kRk = Pi=1 li (E.5)
and the characteristic equation for a square matrix, X, is

kX l Ik = 0

where l is an eigenvalue of X.
The determinant may be found by the det function. The determinant may be used in
estimating the goodness of fit of a particular model to the data, for when the model fits
perfectly, then the inverse of the model times the data will be an identity matrix and the
determinant will be one (See Chapter 6 for much more detail.)

E.3 Matrix operations for data manipulation

Using the basic matrix operations of addition and multiplication allow for easy manipulation
of data. In particular, finding subsets of data, scoring multiple scales for one set of items, or
finding correlations and reliabilities of composite scales are all operations that are easy to do
with matrix operations.
In the next example we consider 5 extraversion items for 200 subjects collected as part
of the Synthetic Aperture Personality Assessment project. The items are taken from the
International Personality Item Pool (ipip.ori.org) and are downloaded from a remote server.
A larger data set taken from the SAPA project is included as the bfi data set in psych. We
use this remote set to demonstrate the ability to read data from the web. Because the first
item is an identification number, we drop the first column
> datafilename = "http://personality-project.org/R/datasets/extraversion.items.txt"
> items = read.table(datafilename, header = TRUE)
474 E Appendix: A Review of Matrices

> items <- items[, -1]


> dim(items)

[1] 200 5
We first use functions from the psych package to describe these data both numerically
and graphically.
> library(psych)

[1] "psych" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base"


> describe(items)
var n mean sd median trimmed mad min max range skew kurtosis se
q_262 1 200 3.07 1.49 3 3.01 1.48 1 6 5 0.23 -0.90 0.11
q_1480 2 200 2.88 1.38 3 2.83 1.48 0 6 6 0.21 -0.85 0.10
q_819 3 200 4.57 1.23 5 4.71 1.48 0 6 6 -1.00 0.71 0.09
q_1180 4 200 3.29 1.49 4 3.30 1.48 0 6 6 -0.09 -0.90 0.11
q_1742 5 200 4.38 1.44 5 4.54 1.48 0 6 6 -0.72 -0.25 0.10
> pairs.panels(items)

We can form two composite scales, one made up of the first 3 items, the other made up of
the last 2 items. Note that the second (q1480) and fourth (q1180) are negatively correlated
with the remaining 3 items. This implies that we should reverse these items before scoring.
To form the composite scales, reverse the items, and find the covariances and then cor-
relations between the scales may all be done by matrix operations on either the items or
on the covariances between the items. In either case, we want to define a “keys” matrix de-
scribing which items to combine on which scale. The correlations are, of course, merely the
covariances divided by the square root of the variances.

E.3.1 Matrix operations on the raw data

> keys <- matrix(c(1, -1, 1, 0, 0, 0, 0, 0, -1, 1), ncol = 2) #specify


> keys # and show the keys matrix
> X <- as.matrix(items) #matrix operations require matrices
> X.ij <- X %*% keys #this results in the scale scores
> n <- dim(X.ij)[1] # how many subjects?
> one <- rep(1, dim(X.ij)[1])
> X.means <- t(one) %*% X.ij/n
> X.cov <- t(X.ij - one %*% X.means) %*% (X.ij - one %*% X.means)/(n - 1)
> round(X.cov, 2)

> keys
[,1] [,2]
[1,] 1 0
[2,] -1 0
[3,] 1 0
E.3 Matrix operations for data manipulation 475

0 2 4 6 0 2 4 6

q_262

5
−0.26 0.41 −0.51 0.48

3
1
6 q_1480
● ● ● ●

● ● ● ●

0.52 −0.47
4
● ● ● ● ●

● ●

● ● ● ●
−0.66
2

● ● ● ● ● ●

● ● ● ● ● ●
0

6
q_819
● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●
● ●●

0.65

4
● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●
−0.41

2
● ● ● ● ● ● ●

● ● ● ●

0
● ●
6

q_1180
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
4

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●
● ● ● ● ●● ● ● ● ● ●●
● ●
−0.49
2

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●
0

● ● ● ● ● ●

6
q_1742
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●● ●● ●●

4
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

2
● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

0
● ● ● ●

1 3 5 0 2 4 6 0 2 4 6

Fig. E.1 Scatter plot matrix (SPLOM) of 5 extraversion items for 200 subjects.

[4,] 0 -1
[5,] 0 1

[,1] [,2]
[1,] 10.45 6.09
[2,] 6.09 6.37

> X.sd <- diag(1/sqrt(diag(X.cov)))


> X.cor <- t(X.sd) %*% X.cov %*% (X.sd)
> round(X.cor, 2)
[,1] [,2]
[1,] 1.00 0.75
[2,] 0.75 1.00
476 E Appendix: A Review of Matrices

E.3.2 Matrix operations on the correlation matrix

The previous example found the correlations and covariances of the scales based upon the
raw data. We can also do these operations on the correlation matrix.
> keys <- matrix(c(1, -1, 1, 0, 0, 0, 0, 0, -1, 1), ncol = 2)
> X.cor <- cor(X)
> round(X.cor, 2)
q_262 q_1480 q_819 q_1180 q_1742
q_262 1.00 -0.26 0.41 -0.51 0.48
q_1480 -0.26 1.00 -0.66 0.52 -0.47
q_819 0.41 -0.66 1.00 -0.41 0.65
q_1180 -0.51 0.52 -0.41 1.00 -0.49
q_1742 0.48 -0.47 0.65 -0.49 1.00
> X.cov <- t(keys) %*% X.cor %*% keys
> X.sd <- diag(1/sqrt(diag(X.cov)))
> X.cor <- t(X.sd) %*% X.cov %*% (X.sd)
> keys
[,1] [,2]
[1,] 1 0
[2,] -1 0
[3,] 1 0
[4,] 0 -1
[5,] 0 1
> round(X.cov, 2)
[,1] [,2]
[1,] 5.66 3.05
[2,] 3.05 2.97
> round(X.cor, 2)
[,1] [,2]
[1,] 1.00 0.74
[2,] 0.74 1.00

E.3.3 Using matrices to find test reliability

The reliability of a test may be thought of as the correlation of the test with a test just like
it. One conventional estimate of reliability, based upon the concepts from domain sampling
theory, is coefficient alpha (al pha). For a test with just one factor, a is an estimate of the
amount of the test variance due to that factor. However, if there are multiple factors in the
test, a neither estimates how much the variance of the test is due to one, general factor, nor
does it estimate the correlation of the test with another test just like it. (See Zinbarg et al.
(2005) for a discussion of alternative estimates of reliabiity.)
E.3 Matrix operations for data manipulation 477

Given either a covariance or correlation matrix of items, a may be found by simple matrix
operations:
1) V= the correlation or covariance matrix
2) Let Vt = the total variance = the sum of all the items in the correlation matrix for that
scale.
3) Let n = number of items in the scale
4)
Vt diag(V ) n
a= ⇤
Vt n 1
To demonstrate the use of matrices to find coefficient a, consider the five items measuring
extraversion taken from the International Personality Item Pool. Two of the items need to
be weighted negatively (reverse scored).
Alpha may be found from either the correlation matrix (standardized alpha) or the co-
variance matrix (raw alpha). In the case of standardized alpha, the diag(V) is the same as
the number of items. Using a key matrix, we can find the reliability of 3 di↵erent scales, the
first is made up of the first 3 items, the second of the last 2, and the third is made up of all
the items.
> datafilename <- "http://personality-project.org/R/datasets/extraversion.items.txt"
> items = read.table(datafilename, header = TRUE)
> items <- items[, -1]
> key <- matrix(c(1, -1, 1, 0, 0, 0, 0, 0, -1, 1, 1, -1, 1, -1, 1), ncol = 3)
> colnames(key) <- c("V1-3", "V4-5", "V1-5")
> rownames(key) <- colnames(items)

> key
V1-3 V4-5 V1-5
q_262 1 0 1
q_1480 -1 0 -1
q_819 1 0 1
q_1180 0 -1 -1
q_1742 0 1 1

> raw.r <- cor(items) #find the correlations -- could have been done with matrix operations
> V <- t(key) %*% raw.r %*% key
> rownames(V) <- colnames(V) <- c("V1-3", "V4-5", "V1-5")
> round(V, 2)
V1-3 V4-5 V1-5
V1-3 5.66 3.05 8.72
V4-5 3.05 2.97 6.03
V1-5 8.72 6.03 14.75
> n <- diag(t(key) %*% key)
> alpha <- (diag(V) - n)/(diag(V)) * (n/(n - 1))
> round(alpha, 2)
V1-3 V4-5 V1-5
0.71 0.66 0.83
478 E Appendix: A Review of Matrices

As would be expected, there are multiple functions in R to score scales and find coefficient
alpha this way. In psych the score.items function will work on raw data, and cluster.cor
function for correlations matrices.

E.4 Multiple correlation

Given a set of n predictors of a criterion variable, what is the optimal weighting of the n
predictors? This is, of course, the problem of multiple correlation or multiple regression.
Although we would normally use the linear model (lm) function to solve this problem, we
can also do it from the raw data or from a matrix of covariances or correlations by using
matrix operations and the solve function.
Consider the data set, X, created in section E.2.1. If we want to predict V4 as a function of
the first three variables, we can do so three di↵erent ways, using the raw data, using deviation
scores of the raw data, or with the correlation matrix of the data.
For simplicity, lets relable V4 to be Y and V1 ... V3 to be X1 ...X3 and then define X as the
first three columns and Y as the last column:
X1 X2 X3
S1 9 4 9
S2 9 7 1
S3 2 9 9
S4 8 2 9
S5 6 4 0
S6 5 9 5
S7 7 9 3
S8 1 1 9
S9 6 4 4
S10 7 5 8
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
7 8 3 6 0 8 0 2 9 6

E.4.1 Data level analyses

At the data level, we can work with the raw data matrix X, or convert these to deviation
scores (X.dev) by subtracting the means from all elements of X. At the raw data level we
have
mŶ1 =m Xnn b1 +m e1 (E.6)
and we can solve for n b1 by pre-multiplying by n X0m (thus making the matrix on the right
side of the equation into a square matrix so that we can multiply through by an inverse. See
section E.2.5)
0
n XmmŶ1
0
=n Xmm Xnn b1 +m e1 (E.7)
and then solving for beta by pre-multiplying both sides of the equation by (XX0 ) 1
E.4 Multiple correlation 479

b = (XX 0 ) 1 X 0Y (E.8)
These beta weights will be the weights with no intercept. Compare this solution to the one
using the lm function with the intercept removed:

> beta <- solve(t(X) %*% X) %*% (t(X) %*% Y)


> round(beta, 2)
[,1]
X1 0.56
X2 0.03
X3 0.25
> lm(Y ~ -1 + X)
Call:
lm(formula = Y ~ -1 + X)

Coefficients:
XX1 XX2 XX3
0.56002 0.03248 0.24723
If we want to find the intercept as well, we can add a column of 1’s to the X matrix. This
matches the normal lm result.

> one <- rep(1, dim(X)[1])


> X <- cbind(one, X)
> print(X)
one X1 X2 X3
S1 1 9 4 9
S2 1 9 7 1
S3 1 2 9 9
S4 1 8 2 9
S5 1 6 4 0
S6 1 5 9 5
S7 1 7 9 3
S8 1 1 1 9
S9 1 6 4 4
S10 1 7 5 8
> beta <- solve(t(X) %*% X) %*% (t(X) %*% Y)
> round(beta, 2)

[,1]
one -0.94
X1 0.62
X2 0.08
X3 0.30

> lm(Y ~ X)
480 E Appendix: A Review of Matrices

Call:
lm(formula = Y ~ X)

Coefficients:
(Intercept) Xone XX1 XX2 XX3
-0.93843 NA 0.61978 0.08034 0.29577
We can do the same analysis with deviation scores. Let X.dev be a matrix of deviation
scores, then can write the equation
Ŷ = Xb + e (E.9)
and solve for
b = (X.devX.dev0 ) 1 X.dev0Y. (E.10)
(We don’t need to worry about the sample size here because n cancels out of the equation).
At the structure level, the covariance matrix = XX0 /(n-1) and X0 Y/(n-1) may be replaced
by correlation matrices by pre and post multiplying by a diagonal matrix of 1/sds) with rxy
and we then solve the equation
b = R 1 rxy (E.11)
Consider the set of 3 variables with intercorrelations (R)
x1 x2 x3
x1 1.00 0.56 0.48
x2 0.56 1.00 0.42
x3 0.48 0.42 1.00
and correlations of x with y ( rxy )
x1 x2 x3
y 0.4 0.35 0.3

From the correlation matrix, we can use the solve function to find the optimal beta weights.
> R <- matrix(c(1, 0.56, 0.48, 0.56, 1, 0.42, 0.48, 0.42, 1), ncol = 3)
> rxy <- matrix(c(0.4, 0.35, 0.3), ncol = 1)
> colnames(R) <- rownames(R) <- c("x1", "x2", "x3")
> rownames(rxy) <- c("x1", "x2", "x3")
> colnames(rxy) <- "y"
> beta <- solve(R, rxy)
> round(beta, 2)
y
x1 0.26
x2 0.16
x3 0.11
Using the correlation matrix to do multiple R is particular useful when the correlation or
covariance matrix is from a published source, or if, for some reason, the original data are not
available. The mat.regress function in psych finds multiple R this way. Unfortunately, by
not having the raw data, many of the error diagnostics are not available.
E.5 Multiple regression as a system of simultaneous equations 481

E.5 Multiple regression as a system of simultaneous equations

Many problems in data analysis require solving a system of simultaneous equations. For
instance, in multiple regression with two predictors and one criterion with a set of correlations
of:
8 9
< rx1x1 rx1x2 rx1y =
rx1x2 rx2x2 rx2y (E.12)
: ;
rx1y rx2y ryy
we want to find the find weights, bi , that when multiplied by x1 and x2 maximize the corre-
lations with y. That is, we want to solve the two simultaneous equations

rx1x1 b1 + rx1x2 b2 = rx1y
. (E.13)
rx1x2 b1 + rx2x2 b2 = rx2y

We can directly solve these two equations by adding and subtracting terms to the two
such that we end up with a solution to the first in terms of b1 and to the second in terms of
b2 :

b1 + rx1x2 b2 /rx1x1 = rx1y /rx1x1
rx1x2 b1 /rx2x2 + b2 = rx2y /rx2x2
which becomes ⇢
b1 = (rx1y rx1x2 b2 )/rx1x1
(E.14)
b2 = (rx2y rx1x2 b1 )/rx2x2
Substituting the second row of (E.14) into the first row, and vice versa we find

b1 = (rx1y rx1x2 (rx2y rx1x2 b1 )/rx2x2 )/rx1x1
b2 = (rx2y rx1x2 (rx1y rx1x2 b2 )/rx1x1 )/rx2x2

Collecting terms, we find:



b1 rx1x1 rx2x2 = (rx1y rx2x2 rx1x2 (rx2y rx1x2 b1 ))
b2 rx2x2 rx1x1 = (rx2y rx1x1 rx1x2 (rx1y rx1x2 b2 )

and rearranging once again:


⇢ 2 b = (r r
b1 rx1x1 rx2x2 rx1x2 1 x1y x2x2 rx1x2 rx2y
b2 rx1x1 rx2x2 2 b = (r r
rx1x2 rx1x2 rx1y
2 x2y x1x1

Struggling on:
⇢ 2 )=r r
b1 (rx1x1 rx2x2 rx1x2 x1y x2x2 rx1x2 rx2y
b2 (rx1x1 rx2x2 2 )=r r
rx1x2 rx1x2 rx1y
x2y x1x1

And finally: ⇢
b1 = (rx1y rx2x2 rx1x2 rx2y )/(rx1x1 rx2x2 2 )
rx1x2
b2 = (rx2y rx1x1 rx1x2 rx1y )/(rx1x1 rx2x2 2 )
rx1x2
482 E Appendix: A Review of Matrices

E.6 Matrix representation of simultaneous equation

Alternatively, these two equations (E.13) may be represented as the product of a vector of
unknowns (the b s ) and a matrix of coefficients of the predictors (the rxi ’s) and a matrix of
coefficients for the criterion (rxi y): 1
✓ ◆
r r
(b1 b2 ) x1x1 x1x2 = (rx1y rx2x2 ) (E.15)
rx1x2 rx2x2

✓ ◆
rx1x1 rx1x2
If we let b = (b1 b2 ), R = and rxy = (rx1y rx2x2 ) then equation (E.15) becomes
rx1x2 rx2x2

b R = rxy (E.16)

and we can solve (E.16) for b by multiplying both sides by the inverse of R.
1 1
b = b RR = rxy R

E.6.1 Finding the inverse of a 2 x 2 matrix

But, how do we find the inverse (R 1 )? As an example we solve the inverse of a 2 x2 matrix,
but the technique may be applied to a matrix of any size. First, define the identity matrix,
I, as ✓ ◆
10
I=
01
and then the equation
R = IR
may be represented as ✓ ◆ ✓ ◆✓ ◆
rx1x1 rx1x2 10 rx1x1 rx1x2
=
rx1x2 rx2x2 01 rx1x2 rx2x2

Dropping the x subscript (for notational simplicity) we have


✓ ◆ ✓ ◆✓ ◆
r11 r12 10 r11 r12
= (E.17)
r12 r22 01 r12 r22

We may multiply both sides of equation (E.17) by a simple transformation matrix (T) without
changing the equality. If we do this repeatedly until the left hand side of equation (E.17) is
the identity matrix, then the first matrix on the right hand side will be the inverse of R. We
do this in several steps to show the process.
Let

1 See Appendix -1 for a detailed discussion of how this is done in practice with some “real” data using
the statistical program, R. In R, the inverse of a square matrix, X, is found by the solve function: X.inv
<- solve(X)
E.6 Matrix representation of simultaneous equation 483
!
1
r11 0
T1 = 1
0 r22

then we multiply both sides of equation (E.17) by T1 in order to make the diagonal elements
of the left hand equation = 1 and we have

T1 R = T1 IR (E.18)
! !✓ ◆
r12 1
1 r11 r11 0 r11 r12
r12 = (E.19)
r22 1 0 r1 r12 r22
22

Then, by letting ✓ ◆
1 0
T2 = r12
r22 1
and multiplying T2 times both sides of equation (E.19) we can make the lower o↵ diagonal
element = 0. (Functionally, we are subtracting rr12 22
times the first row from the second row).
r12 ! r12 ! !✓ ◆
1 r11 1 r11
1
0 r11 r12
r 11
r 2 = r r r 2 = r12 1 (E.20)
0 1 r1112r22 0 11r1122r22 12 r11 r22 r22
r12 r22

Then, in order to make the diagonal elements all = 1 , we let


!
1 0
T3 = 0 r11 r22
r r r2 11 22 12

and multiplying T3 times both sides of equation (E.20) we have


✓ r12 ◆ 1
!✓ ◆
1 r11 r11 0 r11 r12
= r12 r11 (E.21)
0 1 r r r2 r r r2
r12 r22
11 22 12 11 22 12

Then, to make the upper o↵ diagonal element = 0, we let


✓ ◆
1 rr12
T4 = 11
0 1

and multiplying T4 times both sides of equation (E.21) we have


✓ ◆ r22 r12 !✓ ◆
10 2
r11 r22 r12 r11 r22 r122 r11 r12
= r12 r11
01 r r r2 r r r2
r12 r22
11 22 12 11 22 12

That is, the inverse of our original matrix, R, is


r22 r12 !
2
r11 r22 r12 r11 r22 r122
1
R = r12 r11 (E.22)
r11 r22 r122 2
r11 r22 r12

The previous example was drawn out to be easier to follow, and it would be possible
to combine several steps together. The important point is that by successively multiplying
equation E.17 by a series of transformation matrices, we have found the inverse of the original
matrix.
484 E Appendix: A Review of Matrices

T4 T3 T2 T1 R = T4 T3 T2 T1 IR
or, in other words
T4 T3 T2 T1 R = I = R 1 R
1
T4 T3 T2 T1 I = R (E.23)

E.7 A numerical example of finding the inverse

Consider the following Covariance matrix, C, and set of transform matrices, T1 ... T4, as
derived before.
✓ ◆
32
C=
24
The first transformation is to change the diagonal elements to 1 by dividing all elements by
the reciprocal of the diagonal elements. (This is two operations, the first divides elements of
the first row by 3, the second divides elements of the second row by 4).
✓ ◆✓ ◆ ✓ ◆
.33 .00 32 1.0 .667
T1C = =
.00 .25 24 .5 1

The next operation is to make the lower o↵ diagonal element 0 by subtracting .5 times
the first row from the second row.
✓ ◆✓ ◆✓ ◆ ✓ ◆
1.0 0 .33 .00 32 1.0 .667
T2 T1C = =
.5 1 .00 .25 24 0 .667

Then make the make the diagonals 1 again by multiplying elements of the second by 1.5 (this
could be combined with the next operation).
✓ ◆✓ ◆✓ ◆✓ ◆ ✓ ◆
1.0 0 1.0 0 .33 .00 32 1.0 .67
T3 T2 T1C = =
0 1.5 .5 1 .00 .25 24 0 1.0

Now multiply the second row by -.67 and add to the first row. The set of products has created
the identify matrix.
✓ ◆✓ ◆✓ ◆✓ ◆✓ ◆ ✓ ◆
1 .67 1.0 0 1.0 0 .33 .00 32 10
T4 T3 T2 T1C = =
0 1 0 1.5 .5 1 .00 .25 24 01

As shown in equation E.23, if apply this same set of transformations to the identity matrix,
I, we find the inverse of R

✓ ◆✓ ◆✓ ◆✓ ◆✓ ◆ ✓ ◆
1 .67 1.0 0 1.0 0 .33 .00 32 .5 .250
T4 T3 T2 T1 I = =
0 1 0 1.5 .5 1 .00 .25 24 .25 .375

That is,
E.8 Examples of inverse matrices 485
✓ ◆
1 .5 .250
C =
.25 .375
We confirm this by multiplying
✓ ◆✓ ◆ ✓ ◆
32 .5 .250 10
CC 1 = =
24 .25 .375 01
Of course, a much simpler technique is to simply enter the original matrix into R and use
the solve function:
>C <- matrix(c(3,2,2,4),byrow=TRUE,nrow=2)
> C
[,1] [,2]
[1,] 3 2
[2,] 2 4
> solve(C)
[,1] [,2]
[1,] 0.50 -0.250
[2,] -0.25 0.375

E.8 Examples of inverse matrices

E.8.1 Inverse of an identity matrix

The inverse of the identity matrix is just the identity matrix:


I
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
> solve (I)
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1

E.8.2 The e↵ect of correlation size on the inverse

As the correlations in the matrix become larger, the elements of the inverse become dispro-
portionally larger. This is shown on the next page for matrices of size 2 and 3 with correlations
ranging from 0 to .99.
486 E Appendix: A Review of Matrices

The e↵ect of multicollinearity is not particularly surprising when we examine equation


(E.22) and notice that in the two by two case, the elements are divided by r11 r22 r12 2 . As
2
r12 approaches r11 r22 , this ratio will tend towards •.
Because the inverse is used in estimation of the linear regression weights, as the correlations
between the predictors increases, the elements of the inverse grow very large and small
variations in the pattern of predictors will lead to large variations in the beta weights.
E.8 Examples of inverse matrices 487

Original matrix Inverse of Matrix


> a > round(solve(a),2)
[,1] [,2] [,1] [,2]
[1,] 1.0 0.5 [1,] 1.33 -0.67
[2,] 0.5 1.0 [2,] -0.67 1.33
> b > round(solve(b),2)
[,1] [,2] [,1] [,2]
[1,] 1.0 0.8 [1,] 2.78 -2.22
[2,] 0.8 1.0 [2,] -2.22 2.78
> c > round(solve(c),2)
[,1] [,2] [,1] [,2]
[1,] 1.0 0.9 [1,] 5.26 -4.74
[2,] 0.9 1.0 [2,] -4.74 5.26

> A > round(solve(A),2)


[,1] [,2] [,3] [,1] [,2] [,3]
[1,] 1 0 0 [1,] 1 0 0
[2,] 0 1 0 [2,] 0 1 0
[3,] 0 0 1 [3,] 0 0 1
> B > round(solve(B),2)
[,1] [,2] [,3] [,1] [,2] [,3]
[1,] 1.0 0.0 0.5 [1,] 1.38 0.23 -0.76
[2,] 0.0 1.0 0.3 [2,] 0.23 1.14 -0.45
[3,] 0.5 0.3 1.0 [3,] -0.76 -0.45 1.52
>C > round(solve(C),2)
[,1] [,2] [,3] [,1] [,2] [,3]
[1,] 1.0 0.8 0.5 [1,] 3.5 -2.50 -1.00
[2,] 0.8 1.0 0.3 [2,] -2.5 2.88 0.38
[3,] 0.5 0.3 1.0 [3,] -1.0 0.38 1.38
> D > round(solve(D),2)
[,1] [,2] [,3] [,1] [,2] [,3]
[1,] 1.0 0.9 0.5 [1,] 7.58 -6.25 -1.92
[2,] 0.9 1.0 0.3 [2,] -6.25 6.25 1.25
[3,] 0.5 0.3 1.0 [3,] -1.92 1.25 1.58
> E > round(solve(E),2)
[,1] [,2] [,3] [,1] [,2] [,3]
[1,] 1.00 0.95 0.5 [1,] 21.41 -18.82 -5.06
[2,] 0.95 1.00 0.3 [2,] -18.82 17.65 4.12
[3,] 0.50 0.30 1.0 [3,] -5.06 4.12 2.29
> F > round(solve(F),2)
[,1] [,2] [,3] [,1] [,2] [,3]
[1,] 1.00 0.99 0.5 [1,] -39.39 36.36 8.79
[2,] 0.99 1.00 0.3 [2,] 36.36 -32.47 -8.44
[3,] 0.50 0.30 1.0 [3,] 8.79 -8.44 -0.86

You might also like