Professional Documents
Culture Documents
Research Journal 6 PDF
Research Journal 6 PDF
research-article2019
CDPXXX10.1177/0963721419849559McCrae, MõttusWhat Personality Scales Measure
ASSOCIATION FOR
PSYCHOLOGICAL SCIENCE
Current Directions in Psychological
Abstract
Classical psychometrics held that scores on a personality measure were determined by the trait assessed and random
measurement error. A new view proposes a much richer and more complex model that includes trait variance at mul-
tiple levels of a hierarchy of traits and systematic biases shaped by the implicit personality theory of the respondent.
The model has implications for the optimal length and content of scales and for the use of scales intended to correct
for evaluative bias; further, it suggests that personality assessments should supplement self-reports with informant rat-
ings. The model also has implications for the very nature of personality traits.
Keywords
true score, method bias, implicit personality theory, five-factor model, reliability, informant ratings
A typical personality scale consists of several items— in fact the constant component may include systematic
statements or single adjectives. The respondent rates errors or method biases that do not accurately reflect
the extent to which each item describes a target—either the trait and thus do not contribute to the true score in
oneself (in self-reports) or someone whom the respon- our sense of the term.
dent knows well, such as a spouse, sibling, or close Although there are many different kinds of method
friend (in informant reports). The sum of these item bias, by far the most attention has been given to evalu-
ratings is the target’s score on the scale, and ideally it ative biases (variously conceived as social desirability,
is directly proportional to the amount of the trait the impression management, defensiveness, faking good,
target actually possesses. We refer to the veridical por- or simply lying): the enduring tendency of the respon-
tion of the observed score as the true score. dent to exaggerate good qualities and minimize bad
Neither test developers nor respondents are perfect, qualities. People vary in this tendency; some actually
so the scale score only approximates the true score. have negative scores and even exaggerate their weak-
Items can be ambiguous or misread, and numerical nesses. If assessors could measure evaluative bias, they
ratings may be somewhat arbitrary; all of these intro- could subtract it out of the observed score and better
duce random error into the scale score. This is why approximate the true score. Enormous efforts have been
people seldom receive exactly the same score when made to do just that, but with curiously limited success
the test is readministered even a short time later. Retest (Piedmont, McCrae, Riemann, & Angleitner, 2000; Roth
reliability is a measure of how close retest scores are & Altmann, 2019).
to the original scores. Psychologists have primarily relied on self-reports to
In classical psychometrics (e.g., Lord & Novick, assess personality. With only one source of information,
1968), the term “true score” referred to the hypothetical a model of personality scores consisting of true score,
average of an indefinitely large number of administra-
tions of a scale to a respondent; the observed score on Corresponding Author:
any one occasion was thus seen as a constant (i.e., true) Robert R. McCrae, 90 Magnolia Ave., Gloucester, MA 01930
score plus or minus some quantity of random error. But E-mail: rrmccrae@gmail.com
2 McCrae, Mõttus
random error, and evaluative bias may seem plausible. 2010; see Table 1). More reliable facets showed higher
But a body of research has also examined personality validity by all three criteria—but only for retest reli-
ratings from well-acquainted informants and shown ability. Further, retest reliabilities almost invariably
that, like self-reports, these are stable, heritable, and exceeded internal-consistency reliabilities, 2 suggesting
related to a variety of life outcomes. One would be that scales contain something reliable beyond their
tempted to say that self-reports and informant ratings items’ shared variance.
are interchangeable, except that they show far less than In an attempt to understand these findings, McCrae
perfect agreement. For example, Costa and McCrae (2015) analyzed different components of variance in
(1988) reported a 6-year longitudinal study of three facet scales and items, distinguishing between true
broad traits—Neuroticism, Extraversion, and Openness— score, method bias, and random error. It became clear
using both self-reports and spouse ratings. Both assess- that there were different kinds of true-score variance:
ments showed substantial stability, with retest Some was shared by all of the items in a scale, and
correlations of about .80. But agreement between self- some was specific to each item. For example, the NEO
reports and spouse ratings was much lower, only about angry-hostility facet has one item that concerns feelings
.55. Clearly, there must be something in self-reports that of bitterness and resentment and another about being
is stable over time but not shared by spouse ratings, and hot blooded and quick tempered. Both are expressions
vice versa. We argue that this fact allows us to build a of the same trait, but they are also subtly different in
better model of personality scores. content; each represents a different nuance of angry
One might guess that the traits being rated are actu- hostility. Figure 1 illustrates the components of variance
ally somewhat different: Perhaps the self-report, which in a typical facet scale and shows how they contribute
is based in part on private thoughts, feelings, and to various properties of the trait. This model explains
desires, reflects the “inner” personality, whereas the why retest reliability is a better predictor of validity
informant rating reflects the “outer.” If that were true, than internal consistency: It contains all the true-score
we would expect that two different informants would variance (T + S1–k), whereas internal consistency con-
agree more strongly with each other than they agree tains only the common portion (T).
with the target’s self-report because both informants It has been known for decades that personality traits
see the outer personality. But that is not the case; infor- form a hierarchy: The five broad domains at the top of
mants agree with each other only about as well as they the NEO-PI-R hierarchy are composed of narrower and
do with self-reports (e.g., Costa & McCrae, 1992). more specific facets (see Table 1). McCrae (2015)
It appears that each respondent has a view of the hypothesized that there is at least one more lower level,
target shaped in part by the trait but also in part by the nuances that express facets. But are they actually
systematic, enduring, and idiosyncratic biases—source- traits? If so, like higher-level traits, they should be stable
method biases—that consistently over- or underestimate over time and heritable, and different observers should
trait levels. By examining properties of traits, we can agree on them. Mõttus, Kandler, Bleidorn, Riemann,
tease apart the magnitude of these biases. For example, and McCrae (2017) tested these hypotheses in a German
in the Costa and McCrae (1988) study, about 80% of the sample of twins on whom self-report and informant-
variance was stable over time, but only 55% was shared rating data were available from two assessments 5 years
by different sources, suggesting that 25% (80% − 55%) apart. Not surprisingly, most of the items were stable,
is stable source-method bias. The discovery of the pres- heritable, and consensually valid because they are
ence, structure, and magnitude of source-method biases expressions of higher-order traits that are known to
has ushered in a new view of what scales measure, with have those properties. But Mõttus and colleagues also
profound implications for how personality should be analyzed items’ nuance-specific variance after statisti-
assessed.1 cally removing all of the higher-order variance, and a
majority of these item residuals still showed the char-
acteristic properties of traits. These findings were later
Nuances: A New Source of True Score replicated in samples from other countries (Mõttus
Most psychologists know the adage that the reliabil- et al., 2018). Nuances are real—if quite narrow—traits,
ity of a scale sets an upper limit to its validity—but and people appear to have hundreds of them.
there are different forms of reliability. McCrae, Kurtz,
Yamagata, and Terracciano (2011) compared internal-
Hierarchy in Source-Method Biases
consistency reliability (the extent to which all items in
a scale assess the same thing) with retest reliability as McCrae (2015) focused on components of variance within
predictors of the longitudinal stability, heritability, and items and facets. A later examination of relations between
cross-observer validity of the 30 facets of the Revised facets and the broader domains led to further insights
NEO Personality Inventory (NEO-PI-R; McCrae & Costa, about source-method biases (McCrae, 2018a). Although a
What Personality Scales Measure 3
global evaluative bias might affect all traits, there are true-score component (Passini & Norman, 1966); 3 such
also biases specific to domains and to facets. If for some a structure must be attributed to the mind of the rater
reason—say, a faulty first impression—a rater overestimates rather than characteristics of the target. McCrae and
a target’s Conscientiousness, the rater is likely to overesti- colleagues (2018) showed that people also have an
mate the target’s standing on all of its facets: competence, implicit personality theory for nuance-level traits.
order, dutifulness, and so on (see Table 1). The same rater McCrae’s (2018a) model suggested that about 17%
might systematically underestimate the target’s true level of of the variance in the average facet scale was attribut-
Agreeableness and all of its facets. Similar biases exist at able to method bias at the domain level (influencing
the facet level, affecting all items in a facet scale. all facets in a domain), and another 22% was method
McCrae, Mõttus, Hřebíčková, Realo, and Allik (2018) bias at the facet level (influencing all items in a facet).
tested these ideas in Estonian and Czech samples. They McCrae and colleagues (2018) calculated the average
subtracted informant ratings from self-ratings, essen- domain-level bias to be 18% in the Estonian sample and
tially removing the true-score component and leaving 17% in the Czech sample, almost exactly in line with
only biases and random error. When facet residuals predictions. These consistent findings give grounds for
(informant ratings – self-ratings) were factored, they some confidence in McCrae’s (2018a) model and lead
showed a structure that mimicked the structure of the to the sobering conclusion that facet-level variance con-
traits seen in Table 1. When item residuals were fac- tains almost as much source-method bias (39%) 4 as it
tored, they reproduced the structure of facets within does the combined true score of facets and nuances
each domain. These results supported the hypothesis (42%; the rest is random error).
that biases have a hierarchical structure similar to that Although most of the analyses discussed here have
of true scores (McCrae, 2018a). used the NEO inventories, there is every reason to
The analysis of facet residuals might have shown just expect similar findings with other hierarchical instru-
a single factor of evaluative bias, or two or three method ments, such as Lee and Ashton’s (2004) six-factor
factors, or no meaningful structure at all. Why should HEXACO model or Krueger, Derringer, Markon, Watson,
biases have the same structure as real traits? Probably and Skodol’s (2012) Personality Inventory for DSM-5
because people at some level understand how traits go (Ashton, de Vries, & Lee, 2017; McCrae, 2018a).
together; they know, for example, that if people are
sociable, they are also likely to be cheerful and active—
Theoretical Implications
they are extraverts. That people have such an implicit
personality theory has been known for some time. The It now appears that the variance in personality scales
five-factor structure of facet-level traits emerges from is a mixture of true-score variance specific to domains,
ratings of strangers, in which there can be very little facets, and nuances; source-method biases that mimic
4 McCrae, Mõttus
Self-Report Self-Report
Observer Time 1 Time 2
εO εTime 1 εTime 2
MO MS MS
α α
T T rtt T
rca
S1–k S1–k S1–k
Fig. 1. Illustration of the variance attributable to random error (ε), method (M), com-
mon trait (T), and item specifics (S1–k) in facet-scale scores from an observer (O) and
from a self-report test (S) and short-term retest. Dashed lines indicate distinct contri-
butions from different items. Internal consistency (α) is attributable to method and
trait variance shared by items. Cross-observer agreement (r ca) and retest reliability (rtt)
reflect shared components of variance across observers and occasions, respectively.
The bottom half of each column represents the true score; the top half represents ran-
dom and systematic error. This is a simplified model; for a more detailed model, see
McCrae (2018a). Adapted with permission from “A more nuanced view of reliability:
Specificity in the trait hierarchy” by R. R. McCrae, 2015, Personality and Social Psychol-
ogy Review, 19, p. 100. doi:10.1177/1088868314541857. Copyright 2015 by the Society
for Personality and Social Psychology, Inc.
the hierarchical structure of the true score; and random common (their intersection)? Union traits include both
error. Moreover, we can now quantify these compo- the variance shared by all components and the variance
nents. This new psychometrics has both theoretical and unique to each, whereas intersection traits exclude the
practical implications. unique variance; this can make a substantial difference.
A century ago, most psychologists thought the human Personality-scale scores are usually the sum of items and
mind was a blank slate, with personality traits shaped thus implicitly follow the union model, but statistical-
by enculturation, childhood experiences, and learning. modeling approaches such as factor analysis and item-
We have known for the past 50 years that traits are in response theory almost always assume an intersection
part heritable; now we know that this applies not only model. New statistical methods are needed to analyze
to broad temperament factors but also to specific traits, union traits (Bentler, 2017).
even at the level of unique nuances. One person may Nuances are natural units for, and lend credibility to,
be prone to express anger through loud outbursts, network and functionalist approaches that explore asso-
whereas another is prone to seethe with resentment, ciations among numerous personality components to
and this nuanced difference is almost as heritable as any understand why they coalesce into broader traits such
other trait. Personality theories need to account for this. as Extraversion in the first place (Mõttus & Allerhand,
Personality traits are important because they predict 2018; Wood, Gardner, & Harms, 2015).
a host of outcomes, from occupational interests to hap-
piness. Today most researchers use broad domains as Implications for Research and
predictors, although some have advocated the use of
more informative facets (Paunonen & Ashton, 2001). A
Assessment
program of research is needed to determine whether Retest reliability appears to be more informative than
nuances provide yet more predictive power; initial evi- internal consistency and must be used to correct estimates
dence is that they do (Seeboth & Mõttus, 2018). of long-term stability (Costa, McCrae, & Löckenhoff, 2019).
The hierarchical model of trait structure has implica- But researchers usually report only internal consistency
tions for how traits are conceptualized. Is Extraversion because they can calculate it from a single administration
the sum (or, in the language of set theory, the union) and need not reassemble their sample a few days later to
of its component facets and nuances, or is it only the complete the scale again. Wood, Qiu, Lu, Lin, and Tov
essential core all of its facets and nuances have in (2018) have proposed that retest reliability can also be
What Personality Scales Measure 5
5. One-week retest data for the items of the NEO-PI-R are avail- Personality Psychology Compass, 1, 423–440. doi:10.1111/
able in “Item retest correlations” at the following link: https:// j.1751-9004.2007.00021.x
osf.io/cjwtb/ (Henry, & Mõttus, 2019). Mõttus, R., & Allerhand, M. H. (2018). Why do traits come
together? The underlying trait and network approaches.
In V. Zeigler-Hill & T. K. Shackelford (Eds.), The SAGE
References handbook of personality and individual differences: Vol.
Ashton, M. C., de Vries, R. E., & Lee, K. (2017). Trait vari- 1. The science of personality and individual differences
ance and response style variance in the scales of the (pp. 130–151). Thousand Oaks, CA: SAGE.
Personality Inventory for DSM–5 (PID-5). Journal of Mõttus, R., Kandler, C., Bleidorn, M., Riemann, R., & McCrae,
Personality Assessment, 99, 192–203. R. R. (2017). Personality traits below facets: The consen-
Bentler, P. M. (2017). Specificity-enhanced reliability coef- sual validity, longitudinal stability, heritability, and utility
ficients. Psychological Methods, 22, 527–540. of personality nuances. Journal of Personality and Social
Borkenau, P., & Liebler, A. (1992). Trait inferences: Sources Psychology, 112, 474–490.
of validity at zero acquaintance. Journal of Personality Mõttus, R., Sinick, J., Terracciano, A., Hřebíč ková, M., Kandler,
and Social Psychology, 62, 645–657. C., Ando, J., . . . Jang, K. L. (2018). Personality charac-
Costa, P. T., Jr., & McCrae, R. R. (1988). Personality in adult- teristics below facets: A replication and meta-analysis
hood: A six-year longitudinal study of self-reports and of cross-rater agreement, rank-order stability, heritability,
spouse ratings on the NEO Personality Inventory. Journal and utility of personality nuances. Journal of Personality
of Personality and Social Psychology, 54, 853–863. and Social Psychology. Advance online publication.
Costa, P. T., Jr., & McCrae, R. R. (1992). Trait psychology doi:10.1037/pspp0000202
comes of age. In T. B. Sonderegger (Ed.), Nebraska Passini, F. T., & Norman, W. T. (1966). A universal concep-
Symposium on Motivation: Vol. 39. Psychology and aging tion of personality structure? Journal of Personality and
(pp. 169–204). Lincoln: University of Nebraska Press. Social Psychology, 4, 44–49.
Costa, P. T., Jr., McCrae, R. R., & Löckenhoff, C. E. (2019). Paunonen, S. V., & Ashton, M. C. (2001). Big Five factors
Personality across the life span. Annual Review of and facets and the prediction of behavior. Journal of
Psychology, 70, 423–448. Personality and Social Psychology, 81, 524–539.
Henry, S., & Mõttus, R. (2019). Item retest correlations [Data Piedmont, R. L. (1998). The Revised NEO Personality Inventory:
file]. Retrieved from https://osf.io/cjwtb/ Clinical and research applications. New York, NY:
Krueger, R. F., Derringer, J., Markon, K. E., Watson, D., & Plenum Press.
Skodol, A. E. (2012). Initial construction of a maladap- Piedmont, R. L., McCrae, R. R., Riemann, R., & Angleitner, A.
tive personality trait model and inventory for DSM-5. (2000). On the invalidity of validity scales: Evidence from
Psychological Medicine, 42, 1879–1890. self-reports and observer ratings in volunteer samples.
Lee, K., & Ashton, M. C. (2004). Psychometric properties Journal of Personality and Social Psychology, 78, 582–593.
of the HEXACO Personality Inventory. Multivariate Roth, M., & Altmann, T. (2019). A multi-informant study of the
Behavioral Research, 39, 329–358. influence of targets’ and perceivers’ social desirability on
Lord, F. M., & Novick, M. R. (1968). Statistical theories of self-other agreement in ratings of the HEXACO personality
mental test scores. Reading, MA: Addison-Wesley. dimensions. Journal of Research in Personality, 78, 138–147.
McCrae, R. R. (2015). A more nuanced view of reliability: Schmidt, F. L., Le, H., & Ilies, R. (2003). Beyond alpha: An
Specificity in the trait hierarchy. Personality and Social empirical examination of the effects of different sources
Psychology Review, 19, 97–112. of measurement error on reliability estimates for mea-
McCrae, R. R. (2018a). Method biases in single-source personal- sures of individual differences constructs. Psychological
ity assessments. Psychological Assessment, 30, 1160–1173. Methods, 8, 206–224.
McCrae, R. R. (2018b). Method biases in single-source per- Seeboth, A., & Mõttus, R. (2018). Successful explanations start
sonality assessments [Supplemental material]. Retrieved with accurate descriptions: Questionnaire items as per-
from http://supp.apa.org/psycarticles/supplemental/ sonality markers for more accurate predictions. European
pas0000566/pas0000566_supp.html Journal of Personality, 32, 186–201.
McCrae, R. R., & Costa, P. T., Jr. (2010). NEO inventories pro- Singer, J. A. (2005). Personality and psychotherapy: Treating
fessional manual. Odessa, FL: Psychological Assessment the whole person. New York, NY: Guilford Press.
Resources. Soto, C. J., & John, O. P. (2017). The next Big Five Inventory (BFI-
McCrae, R. R., Kurtz, J. E., Yamagata, S., & Terracciano, A. 2): Developing and assessing a hierarchical model with 15
(2011). Internal consistency, retest reliability, and their facets to enhance bandwidth, fidelity, and predictive power.
implications for personality scale validity. Personality and Journal of Personality and Social Psychology, 113, 117–143.
Social Psychology Review, 15, 28–50. Wood, D., Gardner, M. H., & Harms, P. D. (2015). How
McCrae, R. R., Mõttus, R., Hřebíčková, M., Realo, A., & Allik, J. functionalist and process approaches can explain trait
(2018). Source method biases as implicit personality the- covariation. Psychological Review, 122, 84–111.
ory at the domain and facet levels. Journal of Personality. Wood, D., Qiu, L., Lu, J., Lin, H., & Tov, W. (2018). Adjusting
Advance online publication. doi:10.1111/jopy.12435 bilingual ratings by retest reliability improves estima-
McCrae, R. R., & Sutin, A. R. (2007). New frontiers for the tion of translation quality. Journal of Cross-Cultural
five-factor model: A preview of the literature. Social and Psychology, 49, 1325–1339.