Professional Documents
Culture Documents
We Need To Compare, But How? Measurement Equivalence in Comparative Public Administration
We Need To Compare, But How? Measurement Equivalence in Comparative Public Administration
Sebastian Jilke is a postdoctoral Abstract: In addition to public administrations and public managers, there is increasing interest in studying citizens’
researcher at Erasmus University Rotterdam,
the Netherlands. His research interests
interactions with and views toward government from a comparative perspective in order to put theories to the test
include citizen attitudes and behaviors with using cross-national surveys. However, this will only succeed if we adequately deal with the diverse ways in which
respect to public services, administrative respondents in different countries and regions perceive and respond to survey measures. This article examines the con-
reforms, and research methodology. He
has published articles on these topics
cept of cross-national measurement equivalence in public administration research and explores methods for establishing
in European Journal of Political equivalence. Two methodologies are examined that test and correct for measurement nonequivalence: multiple-group
Research, Public Administration, and confirmatory factor analysis and multilevel mixture item response theory. These techniques are used to test and establish
Public Management Review. He is also
coeditor of a forthcoming symposium in
the cross-national measurement equivalence of two popular measurement constructs: citizen satisfaction with public
PAR on the use of experiments in public services and trust in public institutions. Results show that appropriately dealing with nonequivalence accounts for dif-
administration research. ferent forms of biases that otherwise would be undetected. The article contributes to the methodological advancement
E-mail: jilke@fsw.eur.nl
in studying public administration beyond domestic borders.
Bart Meuleman is assistant professor
in the Centre for Sociological Research,
C
University of Leuven, Belgium, where he
onsider the following survey item: “Overall, that are relevant for public administration research,
teaches research methodology. His main how satisfied are you with your electricity such as the International Social Survey Programme,
research interests involve cross-cultural supplier? Please give me score from 0 to 10, the Eurobarometer, the COCOPS (Coordinating for
comparisons of attitude and value patterns,
such as welfare attitudes, ethnocentrism,
where 0 means that you are not satisfied at all, and 10 Cohesion in the Public Sector of the Future) survey
religiosity, and basic human values. means that you are fully satisfied.” This is one out of of public managers, or the Consolidated Omnibus
His research has appeared in Annual a battery of items that taps citizen satisfaction with Budget Reconciliation Act (COBRA) survey of
Review of Sociology, Public Opinion
Quarterly, Journal of Cross-Cultural
public services across a wide range of countries. The government agency executives, among many others.
Psychology, and Journal of European underlying assumption of asking respondents in dif- Making use of such cross-national survey data gives
Social Policy. ferent national populations the same questions is that us the opportunity to test the geographic range of
E-mail: bart.meuleman@soc.kuleuven.be
their answers are supposed to be comparable. In other social theories by assessing them in many different
Steven Van de Walle is professor of words, it is assumed that perceptions of what satisfac- contexts. Moreover, having survey data from numer-
comparative public administration and man- tion means and the way in which people use assigned ous countries enables us to investigate various micro-
agement at Erasmus University Rotterdam,
scales are equivalent across countries, allowing for macro relations by utilizing data from the individual
the Netherlands. His research focuses on
interactions between citizens and public meaningful comparisons. But is the general notion of and the country level. Such cross-level interactions
services, trust, and public sector reform. He what a satisfactory public service is really equivalent permit us to look more closely at interesting relation-
was coordinator of the large-scale COCOPS
across countries, regions, (groups of ) individuals, or ships between context and individuals, allowing us to
project (2010–14), a European research
project on public sector reform in which 11 even over time? And are patterns of response styles explicitly test contextual theories.
European universities collaborated. the same across different cultures? In this article, we
E-mail: vandewalle@fsw.eur.nl
introduce two major techniques for detecting and However, when respondents in different countries
correcting for nonequivalence in the field of public regard measurement constructs in different manners
administration, and we show how these methods can or exhibit culturally influenced response patterns, we
be implemented in applied research. typically obtain biased survey measures (Poortinga
1989; Van de Vijver and Leung 1997). Practically
Comparisons across countries of public administra- speaking, the response of a person in country A—say,
tions, public managers, and interactions and attitudes to the item on satisfaction we used as an example—
of citizens toward government are gaining ground in may have the same scale position as the response of
public administration research (e.g., Jilke 2014; Kim another person in country B, but it could mean some-
et al. 2013; Pollitt and Bouckaert 2011; Van Ryzin thing entirely different if the ways in which respond-
Public Administration Review,
Vol. 75, Iss. 1, pp. 36–48. © 2014 by
2011). This is accompanied by an increase in the avail- ents interpret or respond to it differ substantially. By
The American Society for Public Administration. ability of cross-national surveys that contain questions simply looking at mean levels of survey responses,
DOI: 10.1111/puar.12318.
Latent score
directly relates to dysfunctioning at the item level. An item is said
to be biased “if respondents with the same standing on the underly- 10
ing construct (e.g., they are equally intelligent), but who come from
different cultures, do not have the same mean score on the item”
(Van de Vijver 2003, 148). Common sources of item bias are poor 5
translations and/or ambiguous items, cultural differences in the
connotative meaning of item content, or the influence of culturally
specific nuisance factors such as the involvement of socially desirable 0
answering behavior toward specific items. 0 2 4 6 8 10
Measured score
These types of biases are linked to different forms of measurement Group A Group B
nonequivalence. In order to relate bias to measurement non-
equivalence, we draw on the generalized latent variable framework B 15
(Skrondal and Rabe-Hesketh 2004). Here, it is commonly assumed
that theoretical concepts (latent traits), such as trust or satisfaction,
are not directly observable but are inferred from multiple observed
10
manifestations of the latent trait (Bollen 2002; Davidov et al. 2014).
Latent score
The next form of equivalence, scalar equivalence, suggests that the The described factor analytical model has been extended by Jöreskog
latent variable has the same scale origin across countries, in addi- (1971) to a multiple-group setting. In this MGCFA, the same factor
tion to being measured using the same metric. Scalar equivalence is structure is specified for each group k (where k = 1, . . . , K) simulta-
required when one needs to compare means across different units neously, yielding an overall model fit. Thus, we get
(see Meredith 1993). This type of equivalence refers to the equal-
ity of intercepts across groups and is affected by method and item , (2)
bias. If scalar equivalence holds, this shows that respondents across where Λk stands for a matrix of factor loadings, meaning that it con-
groups not only share the same scale metrics but also the same scale tains one value for each combination of items and the latent con-
origin. This means that they have the same score on the latent and struct for every country. The remaining letters are vectors containing
the observed variables. This can be illustrated by looking at figure the same values as in equation (1), but with a single parameter for
1C, which now depicts an identical line for both groups; note that each group unit. Within such a framework, we can assess measure-
the steepness of the slopes can vary. This means that we can now ment equivalence by comparing parameter estimates across different
compare regression coefficients, covariances, and latent means across countries. In our empirical examples, the groups are inhabitants of
groups, allowing us to conduct substantial cross-national analyses. different countries, but one may also think of comparing different
subnational, socioeducational, or professional groups or even look-
How to Detect and Deal with Measurement ing at the same groups of respondents over time. Regarding the sam-
Nonequivalence? ples size required to perform a confirmatory factor analysis, Kline
Operationalizing the concept of measurement equivalence, we (2013, 179–80) recommends a ratio of at least 20 respondents per
introduce two techniques to the field of public administration to model parameter (see also Jackson 2003), with the overall sample
detect and deal with measurement nonequivalence in compara- size preferred to exceed N = 200. In the context of an MGCFA, that
tive research: (1) multiple-group confirmatory factor analysis and would mean that researchers would need at least 20 respondents per
(2) multilevel mixture item response theory modeling. In the past, parameter, per group. But in cases in which no maximum likelihood
both approaches have enjoyed wide popularity when it comes to estimators are employed or items are non-normally distributed,
testing for measurement equivalence. While according to Kankaraš, much larger samples are needed.
Vermunt, and Moors (2011), differences between the techniques lie
mainly in the terminology, model assumptions, and procedures for Assessing different forms of measurement equivalence.6 As we
testing for measurement equivalence, they also share a great deal of mentioned earlier, three major—hierarchically ordered—forms of
conceptual similarities, as both can be easily summarized within a measurement equivalence are commonly differentiated: configural,
generalized latent variable framework (Skrondal and Rabe-Hesketh metric, and scalar equivalence (Steenkamp and Baumgartner 1998).
2004). But while MGCFA is most appropriate for continuous data,4 Following an iterative process in testing for the different forms of
IRT is specifically designed to deal with data that are of ordered equivalence, Meuleman and Billiet (2012) propose a bottom-up
categorical nature. strategy (see also Steenkamp and Baumgartner 1998). This means
starting with the lowest level of equivalence—that is, the configural
MGCFA primarily aims at testing the equivalence of individual model—and then stepwise testing the next hierarchical levels, first
items and subsequently establishes different levels of measurement metric and then scalar equivalence.
equivalence, including nonequivalence and partial equivalence, in
an iterative process.5 The multilevel mixture IRT model with item Practically speaking, configural equivalence means that a measure-
bias effects that is applied in the later part of this study tests and ment model exhibits the same patterns of salient and nonsalient
corrects for measurement nonequivalence within a single model. factor loadings across groups (see Horn and McArdle 1992).7 It can
Both models can be easily extended to include covariates (see, e.g., be assessed by running an exploratory factor analysis for each country
Davidov et al. 2008; Stegmueller 2011). separately and subsequently comparing the number of factors on
which items loaded and their parameter estimates. Furthermore, one
Multiple-Group Confirmatory Factor Analysis may estimate an MGCFA without constraints across groups and
The standard single-group confirmatory factor analysis is designed check whether fit indices are within an acceptable range. If configural
to test a measurement model, where observed responses to a set of equivalence has been established, on this basis, full metric equivalence
items are denoted as χi (where i = 1, . . . , I) and written as linear is tested by constraining the factor loadings in the measurement
functions of the latent construct ξ that they measure (e.g., satisfac- model to be equal across groups. Formally, this would mean that
tion). The model also typically includes an intercept (τi) and an
error term (δi) for each item, which can be written as follows: . (3)
We Need to Compare, but How? Measurement Equivalence in Comparative Public Administration 39
Thus, metric equivalence can be assessed by comparing multiple Whittaker 2012; see also Oberski 2014). By doing so, researchers
measurement models with constrained and unconstrained factor not only would avoid overfitting, and a rather data-driven approach,
loadings across groups. Moreover, by determining which items’ but also would be put in a position to determine a statistically sig-
slopes are not equivalent across countries, scholars are able to iden- nificant and substantial change in model fit. In line with this reason-
tify nonequivalent survey items. ing, Meuleman and Billiet (2012) recommend using the following
procedure to determine a significant and substantial improvement
The lower levels of equivalence, configural and metric, serve as a (or deterioration) of fit when assessing measurement equivalence.
prerequisite for establishing the next-stronger level of equivalence: First, one needs to determine the slope (or intercept) with the high-
scalar equivalence. It is tested by additionally constraining all inter- est modification index (MI) score—which reports the change in χ2
cepts to be equal across countries (see Meredith 1993) and can be when freeing the respective parameter. If this MI is strongly signifi-
written as follows: cant8 and the associated standardized (STDYX) expected parameter
change is of substantive magnitude, the parameter will be relaxed.
. (4)
However, the described forms of equivalence may not always hold to Item Response Theory Multilevel Mixture Model with Item
the full extent. When this is the case, Byrne, Shavelson, and Muthén Bias Effects
(1989) propose the concept of partial equivalence. Basically, partial While the use of MGCFA to detect measurement nonequivalence
equivalence requires that at least two parameters per country are is often perceived as the predominant approach in cross-national
equivalent, while the others are free to vary. In other words, as long research, modern IRT modeling offers similar advantages, with the
as we have two items with invariant slopes across countries, we can particular difference that IRT techniques are specifically developed
establish partial metric equivalence. Moreover, if we find two items to deal with items that are discrete or ordered categorical. For
with equivalent slopes and intercepts, we can establish partial scalar ordered categorical items, such as Likert scales, this is the so-called
equivalence. The basic idea behind this approach is that we need graded response model (Samejima 1969). It models items’ C – 1
one item, the referent, to identify the scale of the latent variable thresholds (where c is the item category with c = 1, . . . , C), which
and one item to determine the metric of the scale used. In practice, are transformed on a continuous latent response variable. These
this would mean that we can release invariant parameters for some thresholds are mapped on an unobserved continuous variable
items, so long as we have two calibrating items left that are equiva- and, more importantly, represent transitions from one category to
lent across units (see also Steenkamp and Baumgartner 1989). another (commonly referred to as item difficulty). For example, con-
sider an item that probes for citizen trust in government with three
Determining a significant and substantial change in model fit. answer categories. The two thresholds between categories determine
When testing for different levels of measurement equivalence, the the difficulty of moving from one category to another. If we have
evaluation of model fit is of particular interest to researchers who similar respondents in two countries with the same position on the
want to determine whether releasing (or constraining) one additional latent trait of trust but different thresholds between item categories,
parameter substantially changes model fit. The evaluation of model then cross-national bias in response behavior is present.
fit is typically based on the chi-square test (Kline 2011). However, in
larger samples (more than 300 respondents), chi-square is known to Within this framework, we define an item response model for each
be overly sensitive, meaning that it reaches statistical significance for item: individual responses j (where j = 1, . . . , J) for choosing cat-
very trivial model changes (Kline 2011, 201). Thus, various authors egory c are predicted using the cumulative probability νijkc for each
have recommended using alternative goodness-of-fit measures, item i (where i = 1, . . . , I) of a given respondent living in country
such as the root mean square error of approximation (RMSEA) or k (where k = 1, . . . , K). Thus, it is a function of C – 1 thresholds τic
the comparative fit index (CFI), among many others (Chen 2007; (item difficulty) and the latent variable ξjk (that is, the underlying
Williams, Vandenberg, and Edwards 2009). However, while those latent trait we are actually measuring, for example, “trust in public
alternative fit measures do not possess the same problems of sensitiv- institutions”), with the strength of the relationship between item
ity to large sample sizes as chi-square, they have another problem: and latent variable (the so-called discrimination parameter or item
they do not have known sampling distributions. This makes it loading) expressed in the models’ coefficients λi (see Stegmueller
extremely difficult to determine an acceptable cutoff value for a sta- 2011). In other words, individuals’ probability of choosing a higher
tistically significant change in model fit when evaluating equivalence item category is expressed as a result of their stronger “trust” minus
hypotheses (see Meuleman 2012). Moreover, simulation studies have item difficulty. Formally, it can be expressed as follows:
produced very different results when it comes to establishing such
cutoff values. For example, Chen (2007) determined cutoff points . (5)
for global fit indices. However, in a more recent simulation study,
Hox and colleagues conclude that the “reliance on global fit indices is The graded response model can be “estimated with 250 respondents,
misleading when measurement equivalence is tested” (2012, 95; see but around 500 are recommended for accurate parameter estimates
also Saris, Satorra, and Van der Veld 2009 for similar conclusions). [when using a five point Likert scale]” (Reeve and Fayers 2005,
70). However, scholars also need to be aware of the respondent-to-
In line with other authors, Hox and colleagues (2012) recommend parameter ratio; latent traits with many items require more respond-
using more specific indicators of lack of fit, such as expected param- ents than short scales.
eter changes, in combination with their respective modification
indices (Meuleman 2012; Saris, Satorra, and Sörbom 1987; Saris, This conventional graded response model has been extended by
Satorra, and Van der Veld 2009; Steenkamp and Baumgartner 1998; Stegmueller (2011) to a multilevel mixture IRT model with item
40 Public Administration Review • January | February 2015
bias effects. Item bias (denoted as δik) is expressed when item administration research. The first example is on citizen satisfaction
thresholds that are associated with the same score on the latent vari- with public services, and the second example uses data on trust in
able vary across countries. This would mean that crossing a certain public institutions.
category for similar respondents is more difficult in country A than
in country B. If this is the case, items are not equivalent across MGCFA and Citizen Satisfaction with Public Services
countries. Here, instead of testing and subsequently establishing Recent years have seen an increasing interest in studying citizens’
(partial) equivalence (as one would do within an MGCFA frame- views and perceptions of public organizations. At the front line of
work), this approach corrects for measurement nonequivalence by this development is the examination of citizen satisfaction with pub-
explicitly modeling it. This is done by introducing discrete random lic services, including its interrelation with individual expectations
effects for individual items to vary across mixtures m (where m = (James 2009; Morgeson 2013; Van Ryzin 2006, 2013), linkage with
1, . . . , M)—these are groups, or, more precisely, latent classes, of objective assessments of performance (Charbonneau and Van Ryzin
countries that share unobserved heterogeneity in country item bias 2012; Favero and Meier 2013; Shingler et al. 2008), and propensity
(denoted as ηkm).9 In such a model, item bias is allowed to vary to facilitate citizen trust in government (Kampen, Van de Walle, and
across country mixtures that share unobserved heterogeneity in Bouckaert 2006; Vigoda-Gadot 2007). Methodological considera-
systematic responses behavior. In other words, by introducing direct tions in measuring citizen satisfaction with public services have also
effects of these mixtures on items, we are able to explicitly model gathered pace (Herian and Tomkins 2012; Van de Walle and Van
cross-national measurement nonequivalence. Ryzin 2011). Thus, it can be seen that the study of citizen satisfac-
tion with public services is of key interest to public administration
Extending the graded response model, one has to make some scholars. A desirable next step would be the cross-national examina-
changes in notation by first adding subscripts to equation (5), tion of theories of satisfaction in order to see whether they apply to
denoting the level of each parameter, with (1) items being nested different national contexts. Furthermore, linking individual data on
in (2) individuals (where the latent concept of “trust” is located), citizen satisfaction with national or regional macro-level characteris-
nested in (3) countries (where the unobserved heterogeneity in tics (such as the mode of delivery) would probe interesting findings
country item bias is located). This yields a three-level model in regarding micro-macro relationships. In pursuing such a research
which we then also subtract the unobserved country item bias that agenda, however, we first need to test whether citizen satisfaction
varies across mixtures (see Stegmueller 2011). Thus, we get an unbi- indeed exhibits cross-national measurement equivalence.
ased cumulative response probability by specifying,
Data. We use data from the European Consumer Satisfaction
. (6) Survey (ECSS). Implemented on behalf of the European
Commission, the ECSS was fielded in 2006. It covers all EU-25
When estimating this model, first the number of mixtures needs to member countries11 and a total of 11 public services; thus, it is one
be determined. This means that we need to figure out how many of the most comprehensive surveys on citizen satisfaction in Europe.
latent groups there are across countries that share common char- Employing country stratifications according to region, degree of
acteristics in systematic country item bias. Hence, the model from urbanization, gender, age, and education, the ECSS makes use
equation (6) should be estimated with an increasing number of mix- of a representative random sample for each service sector, with a
tures. In a next step, scholars are able to compare fit measures (e.g., minimum of 500 respondents per sector and per country. For our
AICC, BIC, log-likelihood) of the different models to determine example, we use data from the electricity sector.
how many mixtures best fit their data.
Here, service users were asked to indicate their level of satisfaction
In such a framework, one can test for systematic country item within this particular public service sector. More precisely, they were
bias by checking whether the estimates of item bias effects, δi(1), of asked four questions tapping into their general level of satisfaction
single mixtures are significantly different from zero.10 If this is the with electricity services:
case, we have strong evidence for the measurement nonequivalence
of our items. In other words, this would mean that there exists 1. Overall satisfaction (Sat Q1): “Overall, to what extent are
systematic country item bias in response probability stemming you satisfied with [supplier name]? Please give me a score
from nonrandom threshold shifts across countries (see Stegmueller from 1 to 10 where 1 means that you are not satisfied at all,
2011). Ignoring those differences would yield potentially biased and 10 means that you are fully satisfied.”
estimates. Furthermore, this model specification allows us to add 2. Confirmation of expectations (Exp Q2): “If you compare
covariates to the model in equation (6) and subsequently to estimate what you expect from an electricity supplier and what you
the “true effects” of our independent variables of interest. Thus, the get from [supplier name], to what extent would you say that
IRT approach has the distinct advantage that it puts cross-national your requirements are met. Please give me a score from 1 to
researchers in the position to explicitly correct for measurement 10 where 1 means that your expectations are not met at all,
equivalence and to estimate cross-national relationships within a and 10 means that your expectations are not only met but
single model. even exceeded.”
3. Satisfaction with service quality (Qual Q3): “I will read out
Measurement Nonequivalence in Practice a number of statements and would like you to give me, for
Having introduced both empirical techniques, we will next apply each of them, a score where 1 means that you totally dis-
them to real-life data. Our empirical examples come from cross- agree and 10 means that you totally agree: [supplier name]
national public opinion surveys used within comparative public offers high quality services, overall.”
We Need to Compare, but How? Measurement Equivalence in Comparative Public Administration 41
4. Price satisfaction (Price Q4): “I will read out a number indices. More important, the change in chi-square and standardized
of statements and would like you to give me, for each of expected parameter change is displayed (STDYX EPC).
them, a score where 1 means that you totally disagree and
10 means that you totally agree: Overall, [supplier name]’s We start by assessing the configural equivalence of our measurement
prices are fair, given the services provided.” model, which means testing whether it has the same factorial struc-
ture within each country. We were able to establish the equivalence
Assessing cross-national measurement equivalence. For our study of our factor structure for all of the 25 countries under study. This
of citizen satisfaction with electricity services, we first need to spec- means that within each country, all four items loaded significantly
ify the model’s factor structure. All four items should tap the latent on a single factor. Moreover, fit indices of the multiple-group meas-
construct of citizen satisfaction with electricity services. The first urement model indicated that it fits the data well (see table 1, model
two items are quite similar, which is evident from their strong cor- 0). Next we assessed the model’s metric and scalar equivalence. The
relation (r = 0.803, p < .000). Thus, we allow for covariance between full metric model fits the data well, but it still can be improved sub-
them. This can also be theoretically justified, as both items directly stantially by releasing three constrained slopes (factor loadings). We
probe for citizens’ general satisfaction. Moreover, model assess- were not able to establish full metric equivalence, as we found three
ments of individual countries without the covariance between them countries with invariant factor loadings. However, by freeing the
indicated that the model(s) would be significantly and substantially factor loadings for items Q3 and Q4, we can establish partial metric
improved by allowing a correlation between the items. This brings equivalence for all 25 countries. We can now meaningfully compare
us to the measurement model, depicted in figure 2. The figure also parameter estimates across all countries.
shows the factor loadings from the configural equivalent MGCFA
model (highest and lowest country value). The model exhibits good The next level of equivalence, full scalar, is much more difficult
measurement properties: all loadings are significantly different from to satisfy. As depicted in table 1, the full scalar model fits the data
zero and load sufficiently strongly on the latent trait of satisfaction. badly (model 6). However, it can be improved substantially by
releasing 18 intercepts. After this, there were no further possibili-
We test the measurement equivalence of citizen satisfaction with ties left for improving model fit. As we can see from table 1, our
electricity services by using MGCFA. The measurement models final model displays an acceptable fit (model 24), with no fit index
were estimated using Mplus 6. We used a maximum likelihood beyond what is generally considered to be an acceptable cutoff
robust estimator, which accounts for the non-normality of our items value. However, we are still not able to compare means across coun-
(Muthén and Muthén 2010, 533). Furthermore, we employed an tries; in order to make meaningful comparisons, we would need at
estimation procedure that makes use of full information maximum least two items with the same invariant slopes and intercepts across
likelihood (FIML). FIML accounts for item nonresponse by taking countries (partial scalar equivalence). By freeing slopes and inter-
all available data points into the estimation procedure, regardless of cepts for items Q2 and Q4, we can now meaningfully compare coef-
whether there are missing cases (see also Little and Rubin 2002). In ficients and latent country means for 19 countries. Yet this excludes
our case, item nonresponse was slightly above 5 percent. Ireland, Latvia, Lithuania, the Netherlands, Spain, and Sweden, as
they all have nonequivalent intercepts for items Q1 and Q3, which
For our analyses, we first determined the reference item to identify suggests that it is particularly in those countries that items Q1 and
the scale of the latent variable. This choice was not made arbitrarily Q3 function differently.
but was based on a procedure that sets the latent variable’s variance
to 1 for all countries and uses unstandardized modification index MGCFA: Does it matter? In order to exemplify the biases that com-
estimates to select the “most invariant item” (Sass 2011), that is, the parative researchers may tap into when conducting cross-national
item with the lowest overall modification index estimates—in our analyses, we compare the results of our partial scalar equivalence
case, item Q1.12 When it comes to the subsequent order of the test model with the status quo in comparative research, simply comput-
to assess our models’ measurement equivalence, we employed a bot- ing a factor score for the measured concept from the pooled country
tom-up strategy. This is exemplified in table 1, where the iterative data. We estimated simple country fixed effects linear regression
process for equivalence testing is displayed. It shows the respective models using (1) factors scores and (2) the scores from our partial
model’s fit to the data using the Satorra-Bentler scaled chi-square, scalar equivalent MGCFA model. Figure 3 displays the results
the model’s degrees of freedom, and the RMSEA and CFI fit (using Austria, the country with the highest satisfaction scores, as a
reference).
Q
M2 released 219.87 96 0.064 0.983 13.27 –0.236
Q
M11 released 711.41 159 0.093 0.939 44.70 –0.234
Q
M14 released 598.72 156 0.085 0.951 33.12 0.191
Q
M16 released 536.48 154 0.080 0.957 28.02 0.237
Q
M17 released 509.24 153 0.078 0.959 27.25 0.248
Q
M18 released 481.46 152 0.075 0.962 27.78 –0.211
Q
M19 released 461.75 151 0.074 0.964 19.71 –0.219
in the United States indicate that, indeed, the conception of trust M2 4 –31,817 31 63,945 63,914
changes over time, as does people’s individual response behavior M3 5 –31,744 36 63,849 63,813
(Poznyak et al. 2013). In the following, we assess the cross-national M4 6 –31,695 41 63,801 63,760
measurement properties of citizen trust in public institutions using M5 7 –31,652 46 63,765 63,719
the IRT approach. M6 8 –31,623 51 63,758 63,707
M7 9 –31,591 56 63,744 63,688
Data. For this part of our study, we use data from the 2005 World
M8 (final 10 –31,553 61 63,717 63,657
Value Survey (WVS). The WVS is a high-quality and well-known
model)
cross-national survey, established in 1981. It regularly surveys a
M9 11 –31,545 66 63,752 63,686
representative sample of national populations across a very broad
44 Public Administration Review • January | February 2015
Table 3 Citizen Trust in Public Institutions Measurement Model (Model 8) in the western part of Germany underreport their levels of trust. If
Factor Loading Threshold Threshold Threshold
researchers simply compare responses from these countries without
(1)
λi SE τi1 τi2 τi3 correcting for country item bias, they will systematically over- or
Police 1.989* 0.054 2.777* –1.434* –4.601* underestimate peoples’ trust in public institutions.
Justice system 2.637* 0.084 3.938* –0.685* –4.516*
IRT: Does it matter? To exemplify the systematic biases that
Government 1.768* 0.049 4.543* 0.705* –2.620*
comparative scholars may encounter when analyzing cross-national
Civil service 1.503* 0.043 4.370* 0.227* –3.075*
data, we used the results from our IRT model against the stand-
*p < .05. ard approach in the discipline, which is simply computing factor
scores from pooled country data. Figure 4 reports the coefficients
the Bayesian information criterion (BIC) and the consistent Akaike and accompanying 95 percent confidence intervals from linear
information criterion (AICC).15 regression models with country fixed effects. Norway—the country
with the highest levels of trust—is used as the reference category.
In the next step, the properties of our measurement model were From the table, we can clearly see that simply ignoring country
examined. Table 3 presents an overview of the actual factor load- item bias in response probability can lead to misleading results.
ings and their accompanying item thresholds. First, we can see For example, when we look at the factor score coefficients for
that our items exhibit good measurement properties: all items load Switzerland, we may conclude that Switzerland is not significantly
significantly and strongly on one latent trait, that is, trust in public different from Norway. But when looking at the coefficients from
institutions. Moreover, we can see that the thresholds are clearly the IRT approach used in this study, we see that individuals living
spread out across a wide range of our latent variable. Thus, it can be in Switzerland trust their public institutions significantly less than
concluded that our items load statistically and substantively signifi- people living in Norway. The difference between both coefficients
cant on the latent trait and that the thresholds cover a wide range is the result of systematic country item bias in individuals’ item
of the latent variable, providing a precise measurement over a great response probability.
share of the scale of trust in public institutions.
.5
Now we turn to analyzing the extent of systematic country item bias
on individuals’ response behavior. Table 4 reports the coefficients and 0
standard errors of item bias (δik) for each survey item and mixture
component. In order to reach model identification, item bias of the –.5
first item (trust in the police) was set to be zero (see also Stegmueller
2011). From the table, we can clearly see that severe country item –1
bias exists, and it is in the same direction for most countries (except
Bulgaria, Finland, and Norway). For all countries, item bias of at –1.5
least one item is significantly different from zero. This highlights the
crucial role that systematic country differences in response prob- –2
ability play in our measure of trust in public institutions. Item bias
RO
PL
SI
BG
DE-E
IT
ES
US
DE-W
CA
SE
CH
NO
is strongest in Switzerland, West Germany, and Canada. Looking at
effect directions, we have to bear in mind that the WVS survey items
measure trust in a reverse manner—a low value indicates a high Factor Score IRT
level of trust, while a high value indicates a low level of trust. Thus,
we can see that respondents in Switzerland, for example, systemati- Figure 4 Country Fixed Effects and 95% Confidence Intervals,
cally overreport their trust in public institutions, while people living Trust in Public Institutions