We Need To Compare, But How? Measurement Equivalence in Comparative Public Administration

Michael McGuire, Editor Bart Meuleman
University of Leuven, Belgium

Sebastian Jilke
Steven Van de Walle
Erasmus University Rotterdam, the Netherlands
Erasmus University Rotterdam, the Netherlands
Research We Need to Compare, but How? Measurement Equivalence

Synthesis
in Comparative Public Administration
Sebastian Jilke is a postdoctoral Abstract: In addition to public administrations and public managers, there is increasing interest in studying citizens’
researcher at Erasmus University Rotterdam,
the Netherlands. His research interests
interactions with and views toward government from a comparative perspective in order to put theories to the test
include citizen attitudes and behaviors with using cross-national surveys. However, this will only succeed if we adequately deal with the diverse ways in which
respect to public services, administrative respondents in different countries and regions perceive and respond to survey measures. This article examines the con-
reforms, and research methodology. He
has published articles on these topics
cept of cross-national measurement equivalence in public administration research and explores methods for establishing
in European Journal of Political equivalence. Two methodologies are examined that test and correct for measurement nonequivalence: multiple-group
Research, Public Administration, and confirmatory factor analysis and multilevel mixture item response theory. These techniques are used to test and establish
Public Management Review. He is also
coeditor of a forthcoming symposium in
the cross-national measurement equivalence of two popular measurement constructs: citizen satisfaction with public
PAR on the use of experiments in public services and trust in public institutions. Results show that appropriately dealing with nonequivalence accounts for dif-
administration research. ferent forms of biases that otherwise would be undetected. The article contributes to the methodological advancement
E-mail: jilke@fsw.eur.nl
in studying public administration beyond domestic borders.
Bart Meuleman is assistant professor
in the Centre for Sociological Research,
C
University of Leuven, Belgium, where he
onsider the following survey item: “Overall, that are relevant for public administration research,
teaches research methodology. His main how satisfied are you with your electricity such as the International Social Survey Programme,
research interests involve cross-cultural supplier? Please give me score from 0 to 10, the Eurobarometer, the COCOPS (Coordinating for
comparisons of attitude and value patterns,
such as welfare attitudes, ethnocentrism,
where 0 means that you are not satisfied at all, and 10 Cohesion in the Public Sector of the Future) survey
religiosity, and basic human values. means that you are fully satisfied.” This is one out of of public managers, or the Consolidated Omnibus
His research has appeared in Annual a battery of items that taps citizen satisfaction with Budget Reconciliation Act (COBRA) survey of
Review of Sociology, Public Opinion
Quarterly, Journal of Cross-Cultural
public services across a wide range of countries. The government agency executives, among many others.
Psychology, and Journal of European underlying assumption of asking respondents in dif- Making use of such cross-national survey data gives
Social Policy. ferent national populations the same questions is that us the opportunity to test the geographic range of
E-mail: bart.meuleman@soc.kuleuven.be
their answers are supposed to be comparable. In other social theories by assessing them in many different
Steven Van de Walle is professor of words, it is assumed that perceptions of what satisfac- contexts. Moreover, having survey data from numer-
comparative public administration and man- tion means and the way in which people use assigned ous countries enables us to investigate various micro-
agement at Erasmus University Rotterdam,
scales are equivalent across countries, allowing for macro relations by utilizing data from the individual
the Netherlands. His research focuses on
interactions between citizens and public meaningful comparisons. But is the general notion of and the country level. Such cross-level interactions
services, trust, and public sector reform. He what a satisfactory public service is really equivalent permit us to look more closely at interesting relation-
was coordinator of the large-scale COCOPS
across countries, regions, (groups of ) individuals, or ships between context and individuals, allowing us to
project (2010–14), a European research
project on public sector reform in which 11 even over time? And are patterns of response styles explicitly test contextual theories.
European universities collaborated. the same across different cultures? In this article, we
E-mail: vandewalle@fsw.eur.nl
introduce two major techniques for detecting and However, when respondents in different countries
correcting for nonequivalence in the field of public regard measurement constructs in different manners
administration, and we show how these methods can or exhibit culturally influenced response patterns, we
be implemented in applied research. typically obtain biased survey measures (Poortinga
1989; Van de Vijver and Leung 1997). Practically
Comparisons across countries of public administra- speaking, the response of a person in country A—say,
tions, public managers, and interactions and attitudes to the item on satisfaction we used as an example—
of citizens toward government are gaining ground in may have the same scale position as the response of
public administration research (e.g., Jilke 2014; Kim another person in country B, but it could mean some-
et al. 2013; Pollitt and Bouckaert 2011; Van Ryzin thing entirely different if the ways in which respond-
Public Administration Review,
Vol. 75, Iss. 1, pp. 36–48. © 2014 by
2011). This is accompanied by an increase in the avail- ents interpret or respond to it differ substantially. By
The American Society for Public Administration. ability of cross-national surveys that contain questions simply looking at mean levels of survey responses,
DOI: 10.1111/puar.12318.
36 Public Administration Review • January | February 2015

we do not know whether the answers of the respondents can be for the time period 2001 to 2012. The following journals listed in
meaningfully compared. This puts empirical tests at risk, as we can- the Social Science Citation Index were consulted: Administration
not confidently claim measurement equivalence and may end up & Society, American Review of Public Administration, International
comparing apples and oranges. In such a case, results from statistical Review of Administrative Sciences, Public Administration, Public
estimations, such as the theoretical implications that we draw from Administration Review, and Journal of Public Administration Research
cross-national data, are invalid and can lead to spurious conclusions. and Theory.2 The review resulted in a total of 19 articles,3 with
almost 75 percent (14 articles in total) of the studies published since
In this article, we provide an examination of the concept of cross- 2008—emphasizing the growing interest in cross-national survey
national measurement equivalence in public administration and research in the discipline.
discuss how to proceed in establishing comparability of survey
measures. The article is structured as follows: First, we introduce the All articles were reviewed with regard to (1) acknowledging the
concept of measurement equivalence and elaborate on the impor- possibility of measurement nonequivalence for the data used and
tance of utilizing appropriate techniques to deal with measurement (2) whether the authors took any measures to test for nonequiva-
nonequivalence in comparative public administration research. We lence and/or corrected for it. Only two articles from our review
report from a systematic literature review of empirical studies using mentioned the possibility of cross-national nonequivalence of survey
cross-national surveys in public administration and investigate items. From those two articles, only one tested for nonequivalence
whether and how those works have taken the issue of measurement by means of MGCFA. These results are worrisome given the share of
(non)equivalence into account. Then, we introduce two procedures studies that have been produced without appropriately dealing with
to detect, account for, and even explicitly correct for measurement the possible nonequivalence of their survey measures. It suggests
nonequivalence, namely, multiple-group confirmatory factor analy- the limited awareness of public administration scholars of applying
sis (MGCFA) and item response theory (IRT). While MGCFA is postsurvey techniques to deal with the possibility of measurement
most appropriate for continuous data, IRT modeling is best suited nonequivalence (see also Kim et al. 2013), and it highlights the
for ordered categorical (or binary) items.1 Furthermore, we illustrate importance of an accessible primer on measurement equivalence in
the application of these statistical procedures using two empirical comparative public administration.
examples: (1) citizen satisfaction with public services and (2) trust
in public institutions. Our findings indicate how appropriately A Conceptual Framework Linking Measurement Bias
dealing with nonequivalence accounts for different forms of biases with Equivalence
that might otherwise stay undetected. We conclude the article by Measurement equivalence refers to an aspect of the validity of survey
discussing the implications for cross-national survey research within items that tap into an underlying latent concept, such as “satisfac-
the discipline. In doing so, this article contributes to the meth- tion.” It means that “under different conditions of observing and
odological advancement in studying public administration beyond studying phenomena, measurement operations yield measures of the
domestic borders. same attribute” (Horn and McArdle 1992, 117). For measurement
constructs to be equivalent, two attributes must be satisfied. First,
Measurement Equivalence in Comparative Public the unobserved latent trait must share the same meaning across
Administration different groups. Second, the examined latent concept needs to be
In order to expand public administration theories to other cultural scaled equally across countries—meaning that it is measured using
settings, researchers often have to rely on secondary data. Thus, they the same metric. If one of the two attributes does not hold, then
have little or no control over survey design there is no measurement equivalence across
procedures that would help them establish Scholars aiming to utilize cross- groups.
the cross-national equivalence of their items,
national survey data have to
for example, through the use of anchoring Measurement nonequivalence can stem from
vignettes (King et al. 2004). Scholars aiming fi nd appropriate ways to make a variety of sources, all of which are related
to utilize cross-national survey data have to sure that their measurement to different aspects of biases. Conceptually,
find appropriate ways to make sure that their constructs are equivalent across three major types are distinguished: (1)
measurement constructs are equivalent across countries. construct bias, (2) method bias, and (3) item
countries. If this is not done, cross-national bias (see Van de Vijver 2003; Van de Vijver
comparisons are likely to be invalid (see and Leung 1997). Construct bias refers to the
Vandenberg and Lance 2000). Thus, measurement nonequivalence dissimilarity of latent concepts across countries. It means that the
can be considered a serious threat to comparative public administra- configuration and interpretation of a hypothetical construct, such as
tion survey research. “satisfaction” or “trust,” may not be shared among different coun-
tries. In such a case, latent concepts cannot be easily generalized to
In recent years, there has been a growing awareness of applying post- other cultural settings.
survey techniques to assess measurement (non)equivalence. Several
statistical methods have been applied for testing, including MGCFA The second type of bias, method bias, refers to all types of biases
and IRT. This development can be observed across a wide array that come from methodological, procedural aspects of a survey.
of disciplines within the social sciences. However, within public They include (1) the incomparability of national samples, for exam-
administration research, this seems largely ignored. To illustrate this ple, because of the use of different national sampling schemes; (2)
point, we conducted a systematic literature review of journal articles cross-cultural differences in response behavior; and (3) systematic
in public administration that make use of cross-national survey data differences across countries in the survey communication between
We Need to Compare, but How? Measurement Equivalence in Comparative Public Administration 37
interviewer and interviewee. An example in this regard would be latent group means because the position on the observed items
extreme response behavior, where respondents from certain cul- across groups does not correspond equally with the associated score
tures have a tendency to select the endpoint of a given item scale for the latent trait.
(Johnson et al. 2005). In such cases, respondents across countries
may share the same scale position but not the same meaning The next level of measurement equivalence is metric equivalence. It
attached to it. This could lead to a shift in the average mean score, assumes that the scale intervals, or metrics, that measure the latent
suggesting country differences, which are only an artifact of these construct are equal across countries. As a consequence, a one-unit
method effects.
A 20
The third type of bias is called item bias, or differential item

functioning. It basically means that different people understand or
15
interpret the same survey item in different ways. This kind of bias
Latent score
directly relates to dysfunctioning at the item level. An item is said
to be biased “if respondents with the same standing on the underly- 10
ing construct (e.g., they are equally intelligent), but who come from
different cultures, do not have the same mean score on the item”
(Van de Vijver 2003, 148). Common sources of item bias are poor 5
translations and/or ambiguous items, cultural differences in the
connotative meaning of item content, or the influence of culturally
specific nuisance factors such as the involvement of socially desirable 0
answering behavior toward specific items. 0 2 4 6 8 10
Measured score
These types of biases are linked to different forms of measurement Group A Group B
nonequivalence. In order to relate bias to measurement non-
equivalence, we draw on the generalized latent variable framework B 15
(Skrondal and Rabe-Hesketh 2004). Here, it is commonly assumed
that theoretical concepts (latent traits), such as trust or satisfaction,
are not directly observable but are inferred from multiple observed
10
manifestations of the latent trait (Bollen 2002; Davidov et al. 2014).
Latent score
Say that we measure citizen satisfaction with public services using

multiple survey items across a number of countries; we can then test
for measurement equivalence across those countries by “comparing 5
empirical relations between the latent variable and the indicators
across populations. Similarity of these relationships (as reflected by
the measurement parameters) is taken as evidence supporting the
hypothesis of measurement equivalence” (Davidov et al. 2014, 19). 0
0 2 4 6 8 10
Using the generalized latent variable framework, cross-national
Measured score
researchers typically distinguish among three types of nonequiva-
Group A Group B
lence: configural, metric, and scalar equivalence (Steenkamp and
Baumgartner 1998; Vandenberg and Lance 2000). These types C 10
are hierarchically ordered, meaning that lower levels serve as a
prerequisite for establishing the next-higher level of equivalence.
8
Configural equivalence means that a measurement model exhibits
the same factorial structure across all groups under investigation.
Latent score
In other words, it has an equivalent configuration across countries. 6

Configural equivalence is affected by the presence of construct bias.
Moreover, it is considered the lowest level of equivalence and serves 4
as a prerequisite for establishing metric and scalar equivalence.
Thus, by solely establishing configural equivalence, scholars cannot 2
proceed with comparing groups. This can be seen in figure 1A, in
which we depict on the x-axis the measured score of a variable and
on the y-axis the latent score of the associated latent trait for two 0
groups (e.g., respondents in two different countries) that exhibit 0 2 4 6 8 10
configural equivalence only. We can see that comparisons across Measured score
groups are not possible because a one-unit increase in group A has Group A Group B
a much stronger magnitude than it does in group B. Thus, regres-
sion coefficients cannot be compared across groups. Moreover, both Figure 1 (A) Configural Equivalence (B) Metric Equivalence
groups have different scale origins. Hence, we also cannot compare (C) Scalar Equivalence
increase on a scale that exhibits metric equivalence has the same . (1)
meaning across groups. It is affected by method and item bias.
Figure 1B exemplifies a hypothetical latent construct that exhibits In equation (1), λi refers to the slopes, or the factor loadings, of
metric equivalence graphically using simulated data. While the the latent construct ξ. It denotes the change in χi for a one-unit
scale interval is equivalent across groups, meaning that they can increase in ξ. In other words, it displays the regression coefficients
be meaningfully compared, the slopes have different origins. Thus, for single items on the unobserved construct that we measure. In
metric equivalence permits group comparisons of regression coef- turn, the intercepts τi indicate the expected values for the observed
ficients and covariances but not of latent means (see Steenkamp and items when the latent trait is equal to zero (see Steenkamp and
Baumgartner 1998). Baumgartner 1998).
The next form of equivalence, scalar equivalence, suggests that the The described factor analytical model has been extended by Jöreskog
latent variable has the same scale origin across countries, in addi- (1971) to a multiple-group setting. In this MGCFA, the same factor
tion to being measured using the same metric. Scalar equivalence is structure is specified for each group k (where k = 1, . . . , K) simulta-
required when one needs to compare means across different units neously, yielding an overall model fit. Thus, we get
(see Meredith 1993). This type of equivalence refers to the equal-
ity of intercepts across groups and is affected by method and item , (2)
bias. If scalar equivalence holds, this shows that respondents across where Λk stands for a matrix of factor loadings, meaning that it con-
groups not only share the same scale metrics but also the same scale tains one value for each combination of items and the latent con-
origin. This means that they have the same score on the latent and struct for every country. The remaining letters are vectors containing
the observed variables. This can be illustrated by looking at figure the same values as in equation (1), but with a single parameter for
1C, which now depicts an identical line for both groups; note that each group unit. Within such a framework, we can assess measure-
the steepness of the slopes can vary. This means that we can now ment equivalence by comparing parameter estimates across different
compare regression coefficients, covariances, and latent means across countries. In our empirical examples, the groups are inhabitants of
groups, allowing us to conduct substantial cross-national analyses. different countries, but one may also think of comparing different
subnational, socioeducational, or professional groups or even look-
How to Detect and Deal with Measurement ing at the same groups of respondents over time. Regarding the sam-
Nonequivalence? ples size required to perform a confirmatory factor analysis, Kline
Operationalizing the concept of measurement equivalence, we (2013, 179–80) recommends a ratio of at least 20 respondents per
introduce two techniques to the field of public administration to model parameter (see also Jackson 2003), with the overall sample
detect and deal with measurement nonequivalence in compara- size preferred to exceed N = 200. In the context of an MGCFA, that
tive research: (1) multiple-group confirmatory factor analysis and would mean that researchers would need at least 20 respondents per
(2) multilevel mixture item response theory modeling. In the past, parameter, per group. But in cases in which no maximum likelihood
both approaches have enjoyed wide popularity when it comes to estimators are employed or items are non-normally distributed,
testing for measurement equivalence. While according to Kankaraš, much larger samples are needed.
Vermunt, and Moors (2011), differences between the techniques lie
mainly in the terminology, model assumptions, and procedures for Assessing different forms of measurement equivalence.6 As we
testing for measurement equivalence, they also share a great deal of mentioned earlier, three major—hierarchically ordered—forms of
conceptual similarities, as both can be easily summarized within a measurement equivalence are commonly differentiated: configural,
generalized latent variable framework (Skrondal and Rabe-Hesketh metric, and scalar equivalence (Steenkamp and Baumgartner 1998).
2004). But while MGCFA is most appropriate for continuous data,4 Following an iterative process in testing for the different forms of
IRT is specifically designed to deal with data that are of ordered equivalence, Meuleman and Billiet (2012) propose a bottom-up
categorical nature. strategy (see also Steenkamp and Baumgartner 1998). This means
starting with the lowest level of equivalence—that is, the configural
MGCFA primarily aims at testing the equivalence of individual model—and then stepwise testing the next hierarchical levels, first
items and subsequently establishes different levels of measurement metric and then scalar equivalence.
equivalence, including nonequivalence and partial equivalence, in
an iterative process.5 The multilevel mixture IRT model with item Practically speaking, configural equivalence means that a measure-
bias effects that is applied in the later part of this study tests and ment model exhibits the same patterns of salient and nonsalient
corrects for measurement nonequivalence within a single model. factor loadings across groups (see Horn and McArdle 1992).7 It can
Both models can be easily extended to include covariates (see, e.g., be assessed by running an exploratory factor analysis for each country
Davidov et al. 2008; Stegmueller 2011). separately and subsequently comparing the number of factors on
which items loaded and their parameter estimates. Furthermore, one
Multiple-Group Confirmatory Factor Analysis may estimate an MGCFA without constraints across groups and
The standard single-group confirmatory factor analysis is designed check whether fit indices are within an acceptable range. If configural
to test a measurement model, where observed responses to a set of equivalence has been established, on this basis, full metric equivalence
items are denoted as χi (where i = 1, . . . , I) and written as linear is tested by constraining the factor loadings in the measurement
functions of the latent construct ξ that they measure (e.g., satisfac- model to be equal across groups. Formally, this would mean that
tion). The model also typically includes an intercept (τi) and an
error term (δi) for each item, which can be written as follows: . (3)
Thus, metric equivalence can be assessed by comparing multiple Whittaker 2012; see also Oberski 2014). By doing so, researchers
measurement models with constrained and unconstrained factor not only would avoid overfitting, and a rather data-driven approach,
loadings across groups. Moreover, by determining which items’ but also would be put in a position to determine a statistically sig-
slopes are not equivalent across countries, scholars are able to iden- nificant and substantial change in model fit. In line with this reason-
tify nonequivalent survey items. ing, Meuleman and Billiet (2012) recommend using the following
procedure to determine a significant and substantial improvement
The lower levels of equivalence, configural and metric, serve as a (or deterioration) of fit when assessing measurement equivalence.
prerequisite for establishing the next-stronger level of equivalence: First, one needs to determine the slope (or intercept) with the high-
scalar equivalence. It is tested by additionally constraining all interest modification index (MI) score—which reports the change in χ2
cepts to be equal across countries (see Meredith 1993) and can be when freeing the respective parameter. If this MI is strongly signifi-
written as follows: cant8 and the associated standardized (STDYX) expected parameter
change is of substantive magnitude, the parameter will be relaxed.
. (4)
However, the described forms of equivalence may not always hold to Item Response Theory Multilevel Mixture Model with Item
the full extent. When this is the case, Byrne, Shavelson, and Muthén Bias Effects
(1989) propose the concept of partial equivalence. Basically, partial While the use of MGCFA to detect measurement nonequivalence
equivalence requires that at least two parameters per country are is often perceived as the predominant approach in cross-national
equivalent, while the others are free to vary. In other words, as long research, modern IRT modeling offers similar advantages, with the
as we have two items with invariant slopes across countries, we can particular difference that IRT techniques are specifically developed
establish partial metric equivalence. Moreover, if we find two items to deal with items that are discrete or ordered categorical. For
with equivalent slopes and intercepts, we can establish partial scalar ordered categorical items, such as Likert scales, this is the so-called
equivalence. The basic idea behind this approach is that we need graded response model (Samejima 1969). It models items’ C – 1
one item, the referent, to identify the scale of the latent variable thresholds (where c is the item category with c = 1, . . . , C), which
and one item to determine the metric of the scale used. In practice, are transformed on a continuous latent response variable. These
this would mean that we can release invariant parameters for some thresholds are mapped on an unobserved continuous variable
items, so long as we have two calibrating items left that are equiva- and, more importantly, represent transitions from one category to
lent across units (see also Steenkamp and Baumgartner 1989). another (commonly referred to as item difficulty). For example, con-
sider an item that probes for citizen trust in government with three
Determining a significant and substantial change in model fit. answer categories. The two thresholds between categories determine
When testing for different levels of measurement equivalence, the the difficulty of moving from one category to another. If we have
evaluation of model fit is of particular interest to researchers who similar respondents in two countries with the same position on the
want to determine whether releasing (or constraining) one additional latent trait of trust but different thresholds between item categories,
parameter substantially changes model fit. The evaluation of model then cross-national bias in response behavior is present.
fit is typically based on the chi-square test (Kline 2011). However, in
larger samples (more than 300 respondents), chi-square is known to Within this framework, we define an item response model for each
be overly sensitive, meaning that it reaches statistical significance for item: individual responses j (where j = 1, . . . , J) for choosing cat-
very trivial model changes (Kline 2011, 201). Thus, various authors egory c are predicted using the cumulative probability νijkc for each
have recommended using alternative goodness-of-fit measures, item i (where i = 1, . . . , I) of a given respondent living in country
such as the root mean square error of approximation (RMSEA) or k (where k = 1, . . . , K). Thus, it is a function of C – 1 thresholds τic
the comparative fit index (CFI), among many others (Chen 2007; (item difficulty) and the latent variable ξjk (that is, the underlying
Williams, Vandenberg, and Edwards 2009). However, while those latent trait we are actually measuring, for example, “trust in public
alternative fit measures do not possess the same problems of sensitiv- institutions”), with the strength of the relationship between item
ity to large sample sizes as chi-square, they have another problem: and latent variable (the so-called discrimination parameter or item
they do not have known sampling distributions. This makes it loading) expressed in the models’ coefficients λi (see Stegmueller
extremely difficult to determine an acceptable cutoff value for a sta- 2011). In other words, individuals’ probability of choosing a higher
tistically significant change in model fit when evaluating equivalence item category is expressed as a result of their stronger “trust” minus
hypotheses (see Meuleman 2012). Moreover, simulation studies have item difficulty. Formally, it can be expressed as follows:
produced very different results when it comes to establishing such
cutoff values. For example, Chen (2007) determined cutoff points . (5)
for global fit indices. However, in a more recent simulation study,
Hox and colleagues conclude that the “reliance on global fit indices is The graded response model can be “estimated with 250 respondents,
misleading when measurement equivalence is tested” (2012, 95; see but around 500 are recommended for accurate parameter estimates
also Saris, Satorra, and Van der Veld 2009 for similar conclusions). [when using a five point Likert scale]” (Reeve and Fayers 2005,
70). However, scholars also need to be aware of the respondent-to-
In line with other authors, Hox and colleagues (2012) recommend parameter ratio; latent traits with many items require more respond-
using more specific indicators of lack of fit, such as expected param- ents than short scales.
eter changes, in combination with their respective modification
indices (Meuleman 2012; Saris, Satorra, and Sörbom 1987; Saris, This conventional graded response model has been extended by
Satorra, and Van der Veld 2009; Steenkamp and Baumgartner 1998; Stegmueller (2011) to a multilevel mixture IRT model with item
bias effects. Item bias (denoted as δik) is expressed when item administration research. The first example is on citizen satisfaction
thresholds that are associated with the same score on the latent vari- with public services, and the second example uses data on trust in
able vary across countries. This would mean that crossing a certain public institutions.
category for similar respondents is more difficult in country A than
in country B. If this is the case, items are not equivalent across MGCFA and Citizen Satisfaction with Public Services
countries. Here, instead of testing and subsequently establishing Recent years have seen an increasing interest in studying citizens’
(partial) equivalence (as one would do within an MGCFA frame- views and perceptions of public organizations. At the front line of
work), this approach corrects for measurement nonequivalence by this development is the examination of citizen satisfaction with pub-
explicitly modeling it. This is done by introducing discrete random lic services, including its interrelation with individual expectations
effects for individual items to vary across mixtures m (where m = (James 2009; Morgeson 2013; Van Ryzin 2006, 2013), linkage with
1, . . . , M)—these are groups, or, more precisely, latent classes, of objective assessments of performance (Charbonneau and Van Ryzin
countries that share unobserved heterogeneity in country item bias 2012; Favero and Meier 2013; Shingler et al. 2008), and propensity
(denoted as ηkm).9 In such a model, item bias is allowed to vary to facilitate citizen trust in government (Kampen, Van de Walle, and
across country mixtures that share unobserved heterogeneity in Bouckaert 2006; Vigoda-Gadot 2007). Methodological considera-
systematic responses behavior. In other words, by introducing direct tions in measuring citizen satisfaction with public services have also
effects of these mixtures on items, we are able to explicitly model gathered pace (Herian and Tomkins 2012; Van de Walle and Van
cross-national measurement nonequivalence. Ryzin 2011). Thus, it can be seen that the study of citizen satisfac-
tion with public services is of key interest to public administration
Extending the graded response model, one has to make some scholars. A desirable next step would be the cross-national examina-
changes in notation by first adding subscripts to equation (5), tion of theories of satisfaction in order to see whether they apply to
denoting the level of each parameter, with (1) items being nested different national contexts. Furthermore, linking individual data on
in (2) individuals (where the latent concept of “trust” is located), citizen satisfaction with national or regional macro-level characteris-
nested in (3) countries (where the unobserved heterogeneity in tics (such as the mode of delivery) would probe interesting findings
country item bias is located). This yields a three-level model in regarding micro-macro relationships. In pursuing such a research
which we then also subtract the unobserved country item bias that agenda, however, we first need to test whether citizen satisfaction
varies across mixtures (see Stegmueller 2011). Thus, we get an unbi- indeed exhibits cross-national measurement equivalence.
ased cumulative response probability by specifying,
Data. We use data from the European Consumer Satisfaction
. (6) Survey (ECSS). Implemented on behalf of the European
Commission, the ECSS was fielded in 2006. It covers all EU-25
When estimating this model, first the number of mixtures needs to member countries11 and a total of 11 public services; thus, it is one
be determined. This means that we need to figure out how many of the most comprehensive surveys on citizen satisfaction in Europe.
latent groups there are across countries that share common char- Employing country stratifications according to region, degree of
acteristics in systematic country item bias. Hence, the model from urbanization, gender, age, and education, the ECSS makes use
equation (6) should be estimated with an increasing number of mix- of a representative random sample for each service sector, with a
tures. In a next step, scholars are able to compare fit measures (e.g., minimum of 500 respondents per sector and per country. For our
AICC, BIC, log-likelihood) of the different models to determine example, we use data from the electricity sector.
how many mixtures best fit their data.
Here, service users were asked to indicate their level of satisfaction
In such a framework, one can test for systematic country item within this particular public service sector. More precisely, they were
bias by checking whether the estimates of item bias effects, δi(1), of asked four questions tapping into their general level of satisfaction
single mixtures are significantly different from zero.10 If this is the with electricity services:
case, we have strong evidence for the measurement nonequivalence
of our items. In other words, this would mean that there exists 1. Overall satisfaction (Sat Q1): “Overall, to what extent are
systematic country item bias in response probability stemming you satisfied with [supplier name]? Please give me a score
from nonrandom threshold shifts across countries (see Stegmueller from 1 to 10 where 1 means that you are not satisfied at all,
2011). Ignoring those differences would yield potentially biased and 10 means that you are fully satisfied.”
estimates. Furthermore, this model specification allows us to add 2. Confirmation of expectations (Exp Q2): “If you compare
covariates to the model in equation (6) and subsequently to estimate what you expect from an electricity supplier and what you
the “true effects” of our independent variables of interest. Thus, the get from [supplier name], to what extent would you say that
IRT approach has the distinct advantage that it puts cross-national your requirements are met. Please give me a score from 1 to
researchers in the position to explicitly correct for measurement 10 where 1 means that your expectations are not met at all,
equivalence and to estimate cross-national relationships within a and 10 means that your expectations are not only met but
single model. even exceeded.”
3. Satisfaction with service quality (Qual Q3): “I will read out
Measurement Nonequivalence in Practice a number of statements and would like you to give me, for
Having introduced both empirical techniques, we will next apply each of them, a score where 1 means that you totally dis-
them to real-life data. Our empirical examples come from cross- agree and 10 means that you totally agree: [supplier name]
national public opinion surveys used within comparative public offers high quality services, overall.”
4. Price satisfaction (Price Q4): “I will read out a number indices. More important, the change in chi-square and standardized
of statements and would like you to give me, for each of expected parameter change is displayed (STDYX EPC).
them, a score where 1 means that you totally disagree and
10 means that you totally agree: Overall, [supplier name]’s We start by assessing the configural equivalence of our measurement
prices are fair, given the services provided.” model, which means testing whether it has the same factorial struc-
ture within each country. We were able to establish the equivalence
Assessing cross-national measurement equivalence. For our study of our factor structure for all of the 25 countries under study. This
of citizen satisfaction with electricity services, we first need to spec- means that within each country, all four items loaded significantly
ify the model’s factor structure. All four items should tap the latent on a single factor. Moreover, fit indices of the multiple-group meas-
construct of citizen satisfaction with electricity services. The first urement model indicated that it fits the data well (see table 1, model
two items are quite similar, which is evident from their strong cor- 0). Next we assessed the model’s metric and scalar equivalence. The
relation (r = 0.803, p < .000). Thus, we allow for covariance between full metric model fits the data well, but it still can be improved sub-
them. This can also be theoretically justified, as both items directly stantially by releasing three constrained slopes (factor loadings). We
probe for citizens’ general satisfaction. Moreover, model assess- were not able to establish full metric equivalence, as we found three
ments of individual countries without the covariance between them countries with invariant factor loadings. However, by freeing the
indicated that the model(s) would be significantly and substantially factor loadings for items Q3 and Q4, we can establish partial metric
improved by allowing a correlation between the items. This brings equivalence for all 25 countries. We can now meaningfully compare
us to the measurement model, depicted in figure 2. The figure also parameter estimates across all countries.
shows the factor loadings from the configural equivalent MGCFA
model (highest and lowest country value). The model exhibits good The next level of equivalence, full scalar, is much more difficult
measurement properties: all loadings are significantly different from to satisfy. As depicted in table 1, the full scalar model fits the data
zero and load sufficiently strongly on the latent trait of satisfaction. badly (model 6). However, it can be improved substantially by
releasing 18 intercepts. After this, there were no further possibili-
We test the measurement equivalence of citizen satisfaction with ties left for improving model fit. As we can see from table 1, our
electricity services by using MGCFA. The measurement models final model displays an acceptable fit (model 24), with no fit index
were estimated using Mplus 6. We used a maximum likelihood beyond what is generally considered to be an acceptable cutoff
robust estimator, which accounts for the non-normality of our items value. However, we are still not able to compare means across coun-
(Muthén and Muthén 2010, 533). Furthermore, we employed an tries; in order to make meaningful comparisons, we would need at
estimation procedure that makes use of full information maximum least two items with the same invariant slopes and intercepts across
likelihood (FIML). FIML accounts for item nonresponse by taking countries (partial scalar equivalence). By freeing slopes and inter-
all available data points into the estimation procedure, regardless of cepts for items Q2 and Q4, we can now meaningfully compare coef-
whether there are missing cases (see also Little and Rubin 2002). In ficients and latent country means for 19 countries. Yet this excludes
our case, item nonresponse was slightly above 5 percent. Ireland, Latvia, Lithuania, the Netherlands, Spain, and Sweden, as
they all have nonequivalent intercepts for items Q1 and Q3, which
For our analyses, we first determined the reference item to identify suggests that it is particularly in those countries that items Q1 and
the scale of the latent variable. This choice was not made arbitrarily Q3 function differently.
but was based on a procedure that sets the latent variable’s variance
to 1 for all countries and uses unstandardized modification index MGCFA: Does it matter? In order to exemplify the biases that com-
estimates to select the “most invariant item” (Sass 2011), that is, the parative researchers may tap into when conducting cross-national
item with the lowest overall modification index estimates—in our analyses, we compare the results of our partial scalar equivalence
case, item Q1.12 When it comes to the subsequent order of the test model with the status quo in comparative research, simply comput-
to assess our models’ measurement equivalence, we employed a bot- ing a factor score for the measured concept from the pooled country
tom-up strategy. This is exemplified in table 1, where the iterative data. We estimated simple country fixed effects linear regression
process for equivalence testing is displayed. It shows the respective models using (1) factors scores and (2) the scores from our partial
model’s fit to the data using the Satorra-Bentler scaled chi-square, scalar equivalent MGCFA model. Figure 3 displays the results
the model’s degrees of freedom, and the RMSEA and CFI fit (using Austria, the country with the highest satisfaction scores, as a
reference).
Differences between the approaches are striking. For example, using

Sat Q1 λQ1 = 1
the standard factor score approach shows no significant difference
λQ2 = 0.718/1.090 between Austria and Luxembourg, whereas the results from our
Exp Q2 MGCFA model suggest that people in Luxembourg are significantly
Citizen satisfaction less likely to be satisfied with the services they receive; the same holds
true for Slovenia and Ireland. The difference between the coefficients
Qual Q3
λQ3 = 0.563/1.579 is the result of measurement nonequivalence. Furthermore, we can
λQ4 = 0.618/1.466
see that in many cases, the MGCFA approach led to significantly
Price Q4 different coefficients, resulting in a renewed country order of levels
of citizen satisfaction. Using pooled factor scores, one may con-
Figure 2 Measurement Model of Citizen Satisfaction clude that people living in the Czech Republic, for example, are less
Table 1 Equivalence Tests for Citizen Satisfaction with Electricity Services
Model Specifications χ2 df RMSEA CFI Δχ2 STDYX EPC
M0 Configural equivalence 29.60 25 0.035 0.999 — —
M1 Full metric equivalence 233.14 97 0.067 0.981 — —
Q
M2 released 219.87 96 0.064 0.983 13.27 –0.236
M3 Q released 204.52 95 0.062 0.984 15.35 –0.176
M4 Q released 189.35 94 0.059 0.986 15.17 –0.215
M5 Q released 170.97 93 0.055 0.988 18.38 0.258

M6 Full scalar equivalence 1028.27 165 0.113 0.907 — —
M7 Q released 960.18 164 0.109 0.914 68.09 0.428
M8 released 904.00 163 0.106 0.920 56.18 –0.250
M9 Q released 801.29 161 0.099 0.930 102.71 0.176
M10 Q released 756.11 160 0.096 0.935 45.18 –0.238
Q
M11 released 711.41 159 0.093 0.939 44.70 –0.234
M12 released 670.74 158 0.090 0.943 40.67 –0.228
M13 Q released 631.84 157 0.087 0.947 38.91 0.260
Q
M14 released 598.72 156 0.085 0.951 33.12 0.191
M15 Q released 564.50 155 0.082 0.954 34.21 0.155
Q
M16 released 536.48 154 0.080 0.957 28.02 0.237
Q
M17 released 509.24 153 0.078 0.959 27.25 0.248
Q
M18 released 481.46 152 0.075 0.962 27.78 –0.211
Q
M19 released 461.75 151 0.074 0.964 19.71 –0.219
M20 Q released 441.47 150 0.072 0.966 20.27 –0.213
M21 Q released 423.68 149 0.070 0.968 17.79 –0.208
M22 Q released 408.10 148 0.069 0.969 15.58 –0.215
M23 Q released 391.89 147 0.067 0.971 16.21 –0.165
M24 Q released 377.64 146 0.066 0.972 14.25 –0.155

N = 13,155.
Note: Chi-square refers to the Satorra-Bentler scaled chi-square.

.5 range of countries. It encompasses items on various theoretical
concepts, including institutional trust. Using the WVS institutional
trust inventory, Newton and Norris (2000) distinguish between
0 trust in private and in public institutions. The latter set of items
is used for our IRT analysis, including trust in (1) the police, (2)
the justice system, (3) the government, and (4) the civil service.
–.5
More specifically, respondents were asked, “I am going to name
a number of organizations. For each one, could you tell me how
–1 much confidence you have in them: is it a great deal of confidence,
quite a lot of confidence, not very much confidence or none at all.”
This yields a set of four ordinal items that tap into individuals’ trust
–1.5 in public institutions. Our analysis was conducted for 14 European
Union and/or OECD (Organisation for Economic Co-operation
PT
MT
IT
GR
NL
ES*
CZ
SE*
SK
FR
UK
PL
BE
FI
HU
LV*
EE
CY
DE
SI
LU
IE*
LT*
DK
and Development) member countries: Bulgaria (BG), Canada (CA),
Factor Score MGCFA East Germany (DE-E), Finland (FI), Italy (IT), Norway (NO),
Poland (PL), Romania (RO), Spain (ES), Slovenia (SI), Sweden
*Countries with invariant intercepts. (SE), Switzerland (CH), United States (US), and West Germany
(DE-W).
Figure 3 Country Fixed Effects and 95% Confidence Intervals,
Citizen Satisfaction
Assessing cross-national measurement equivalence. To apply the
multilevel mixture IRT model to real-life data on people’s trust in
satisfied with their electricity services than individuals from Slovakia, public institutions, we use the statistical software Latent GOLD
France, the United Kingdom, Poland, and Belgium. But the version 4.5. In order to ease the estimation process, we drew a 50
MGCFA results point toward the opposite conclusion: respondents percent random subsample for our analysis (see Stegmueller 2011
in the Czech Republic are in fact more satisfied with their services for a similar procedure). The hierarchical conceptualization of
than respondents from those other countries. These differences are our multilevel IRT model (items nested in individuals, nested in
the result of measurement nonequivalence, and not taking them into countries), enables us to account for item nonresponse in a transpar-
account can lead to biased results and wrong theoretical conclusions. ent way. Assuming missing data at random (Little and Rubin 2002)
merely resulted in different cluster sizes at level 1. We found a share
IRT and Trust in Public Institutions of 9 percent of missing data in our trust measure and used a total of
Trust in public institutions is regarded as an assessment of the 8,317 respondents.13
performance and procedural quality of these institutions. This
trust is thought to influence citizens’ will- We determined the number of latent classes of
ingness to obey or cooperate, and as such, Trust in public institutions is countries that share common characteristics
it is an indicator of a government’s (politi- regarded as an assessment of the in individuals’ response behavior (mixtures)
cal) legitimacy (Marien and Hooghe 2011). by estimating the IRT model, as described in
Various scholars in public administration
performance and procedural the previous part of this study (equation [6]),
research have conducted empirical analyses quality of these institutions. with an increasing number of mixtures. In
of the determinants of institutional trust the next step, we compared the fit measures
and looked at aspects such as performance, procedural quality, or of different models to determine which number of mixtures best
transparency (Grimmelikhuijsen and Meijer 2014; Van de Walle fit our data (see table 2). We found that the model with a total of
and Bouckaert 2003; Van Ryzin 2011). Trust in individual insti- 10 mixture components yielded the best model fit.14 This can be
tutions is sometimes regarded as a reflection not just of a specific illustrated by looking at the information theory-based fit measures:
institution’s individual qualities but also of a wider propensity to
trust public institutions (Mishler and Rose 1997). Various cross- Table 2 Determining the Number of Mixture Components for Multilevel IRT
Mixture Model
national analyses have been conducted in this regard, yet whether
the concept of trust in public institutions travels across domestic No. of Mixture No. of
Model Components Log-Likelihood Parameters AICC BIC
borders was not, to our knowledge, subject to analysis. Studies of
the longitudinal measurement equivalence of trust in government M1 3 –31,900 26 64,062 64,036
in the United States indicate that, indeed, the conception of trust M2 4 –31,817 31 63,945 63,914
changes over time, as does people’s individual response behavior M3 5 –31,744 36 63,849 63,813
(Poznyak et al. 2013). In the following, we assess the cross-national M4 6 –31,695 41 63,801 63,760
measurement properties of citizen trust in public institutions using M5 7 –31,652 46 63,765 63,719
the IRT approach. M6 8 –31,623 51 63,758 63,707
M7 9 –31,591 56 63,744 63,688
Data. For this part of our study, we use data from the 2005 World
M8 (final 10 –31,553 61 63,717 63,657
Value Survey (WVS). The WVS is a high-quality and well-known
model)
cross-national survey, established in 1981. It regularly surveys a
M9 11 –31,545 66 63,752 63,686
representative sample of national populations across a very broad
Table 3 Citizen Trust in Public Institutions Measurement Model (Model 8) in the western part of Germany underreport their levels of trust. If
Factor Loading Threshold Threshold Threshold
researchers simply compare responses from these countries without
(1)
λi SE τi1 τi2 τi3 correcting for country item bias, they will systematically over- or
Police 1.989* 0.054 2.777* –1.434* –4.601* underestimate peoples’ trust in public institutions.
Justice system 2.637* 0.084 3.938* –0.685* –4.516*
IRT: Does it matter? To exemplify the systematic biases that
Government 1.768* 0.049 4.543* 0.705* –2.620*
comparative scholars may encounter when analyzing cross-national
Civil service 1.503* 0.043 4.370* 0.227* –3.075*
data, we used the results from our IRT model against the stand-
*p < .05. ard approach in the discipline, which is simply computing factor
scores from pooled country data. Figure 4 reports the coefficients
the Bayesian information criterion (BIC) and the consistent Akaike and accompanying 95 percent confidence intervals from linear
information criterion (AICC).15 regression models with country fixed effects. Norway—the country
with the highest levels of trust—is used as the reference category.
In the next step, the properties of our measurement model were From the table, we can clearly see that simply ignoring country
examined. Table 3 presents an overview of the actual factor load- item bias in response probability can lead to misleading results.
ings and their accompanying item thresholds. First, we can see For example, when we look at the factor score coefficients for
that our items exhibit good measurement properties: all items load Switzerland, we may conclude that Switzerland is not significantly
significantly and strongly on one latent trait, that is, trust in public different from Norway. But when looking at the coefficients from
institutions. Moreover, we can see that the thresholds are clearly the IRT approach used in this study, we see that individuals living
spread out across a wide range of our latent variable. Thus, it can be in Switzerland trust their public institutions significantly less than
concluded that our items load statistically and substantively signifi- people living in Norway. The difference between both coefficients
cant on the latent trait and that the thresholds cover a wide range is the result of systematic country item bias in individuals’ item
of the latent variable, providing a precise measurement over a great response probability.
share of the scale of trust in public institutions.
.5
Now we turn to analyzing the extent of systematic country item bias
on individuals’ response behavior. Table 4 reports the coefficients and 0
standard errors of item bias (δik) for each survey item and mixture
component. In order to reach model identification, item bias of the –.5
first item (trust in the police) was set to be zero (see also Stegmueller
2011). From the table, we can clearly see that severe country item –1
bias exists, and it is in the same direction for most countries (except
Bulgaria, Finland, and Norway). For all countries, item bias of at –1.5
least one item is significantly different from zero. This highlights the
crucial role that systematic country differences in response prob- –2
ability play in our measure of trust in public institutions. Item bias
RO
PL
SI
BG
DE-E
IT
ES
US
DE-W
CA
SE
CH
NO
is strongest in Switzerland, West Germany, and Canada. Looking at
effect directions, we have to bear in mind that the WVS survey items
measure trust in a reverse manner—a low value indicates a high Factor Score IRT
level of trust, while a high value indicates a low level of trust. Thus,
we can see that respondents in Switzerland, for example, systemati- Figure 4 Country Fixed Effects and 95% Confidence Intervals,
cally overreport their trust in public institutions, while people living Trust in Public Institutions
Table 4 Item Bias Effects (Model 8)

Justice System Government Civil Service
Country Coefficient SE Coefficient SE Coefficient SE
SE –0.824* 0.154 0.087 0.118 –0.754* 0.117
CH –0.331* 0.141 –1.008* 0.110 –0.798* 0.104
DE-W 0.263 0.158 0.907* 0.124 0.867* 0.116
BG 0.438* 0.158 –0.525* 0.124 –0.494* 0.118
ES –0.554* 0.141 –1.129* 0.113 –0.197 0.104
FIN, NO –0.084 0.119 0.055 0.092 0.211* 0.086
IT, US 0.590* 0.109 0.493* 0.085 0.324* 0.079
DE-E 0.119 0.149 1.013* 0.117 0.642* 0.108
PL, RO, SI –0.536* 0.104 –0.725* 0.082 –0.111 0.075
CA 0.919* 0.118 0.823* 0.092 0.311* 0.085
*p < .05.

Furthermore, we find a different country order of levels of trust. article has outlined two major techniques for doing so. We are aware
Using pooled factor scores, one may conclude that people living of the increased difficulty of estimation that this may bring; how-
in Spain, for example, are less trusting of public institutions than ever, the results from the last section of the article clearly show that
individuals from Italy or East Germany. The results from the IRT correcting for measurement nonequivalence is not only a matter of
approach, however, suggests that mean levels of trust are actually fine-tuning estimates of interest to methodologists but is of substan-
higher in Italy compared to those countries. When looking at the tial importance when aiming to derive approximately unbiased results
position of Sweden, for example, a similar picture emerges. The that form the basis of our theoretical implications. Put simply, “doing
pooled factor scores suggest that Swedes are more trusting of public serious comparative work entails additional effort” (Pollitt 2011, 124).
institutions than respondents from Canada, West Germany, and
the United States. However, after accounting for item bias effects Acknowledgments
through the illustrated IRT approach, a different picture comes into The authors would like to thank Daniel Stegmueller for his advice
focus. Now, respondents from Sweden are less trusting than people on the implementation of the statistical procedures. We also would
from those other countries. Again, these results come from system- like to thank Rhys Andrews, Jan Boon, Chiara De Caluwé, Cornelia
atic country item bias. Simply ignoring these differences can lead to Hoffmann, Oliver James, Sjors Overman, and Sandra Van Thiel
invalid results and wrong theoretical conclusions. for their helpful comments on this project. All remaining errors or
mistakes are, of course, our own. The research leading to these results
Conclusions: Measurement (Non)Equivalence was supported under the European Union’s Seventh Framework
in Comparative Public Administration Programme under grant agreement no. 266887 (Project COCOPS).
Within comparative public administration survey research, it is
common practice to assume the equivalence of latent traits and their Notes
accompanying survey items. Researchers often simply pool items from 1. MGCFA can also be applied to test for measurement equivalence with binary/
different countries and subsequently utilize the factor scores of the ordinal items using suitable estimators (see Millsap and Yun-Tein 2004).
latent construct they measure. Seemingly, there is limited awareness 2. With the exception of the International Review of Administrative Sciences, which
among cross-national researchers within the field of public adminis- we added because of its explicit comparative scope, these journals have been
tration of the serious bias that may be induced by pursuing such an included in previous reviews on research methodology in public administration
estimation strategy (see Kim et al. 2013 for a because they are thought to be the mainstream journals
notable exception). Our article has shown that Our article has shown that within the discipline (see, e.g., Brower, Abolafia, and Carr
conducting meaningful cross-national analyses conducting meaningful cross- 2000; Lee, Benoit-Bryan, and Johnson 2012; Wright,
requires consideration of the cross-national Manigault, and Black 2004).
national analyses requires con-
equivalence of survey measures. Estimating 3. The full list of articles that we included in our review
inferential models from comparative data sideration of the cross-national can be found on the first author’s website: http://
without taking into account the possibility of equivalence of survey measures. sebastianjilke.files.wordpress.com/2014/10/listofarti-
measurement nonequivalence can lead to spuri- cleslitreview1.pdf.
ous results and misleading conclusions. 4. However, MGCFA estimators also exist that allow for using items that are
ordered categorical or binary (see, e.g., Millsap and Yun-Tein 2004).
This article has presented two techniques to test and correct for 5.
For a more technical comparison of the techniques, we refer to Kankaraš,
measurement nonequivalence of comparative survey data in public Vermunt, and Moors (2011); Raju, Lafitte, Byrne (2002); and Reise, Widaman,
administration research. Our empirical examples show the biases and Pugh (1993).
one may get when pursuing the default approach of simply assum- 6. In the confirmatory factor analysis literature, most authors use the term meas-
ing the equivalence of measurement constructs. It also exemplifies urement invariance instead of measurement equivalence. However, to remain
that the concepts that we tested (citizen satisfaction and trust in consistent across introduced techniques and applications, we use the term equiv-
public institutions) do not exhibit cross-national measurement alence interchangeably with invariance, meaning that our observed items and
equivalence. Researchers who wish to conduct cross-national analy- their factorial structures do not vary across countries and thus are equivalent.
ses using these concepts are advised to account for their nonequiva- 7. This does not mean that the strength of factor loadings cannot differ, as there are
lence. Comparative scholars who use measurement constructs whose no restrictions for their magnitude (see Steenkamp and Baumgartner 1998, 80).
cross-national measurement properties are unknown also should do 8. This implies a Bonferroni-type correction to account for the fact that multi-
so—otherwise they risk biased results. ple tests are conducted at the same time: one test per parameter, per country
(Meuleman and Billiet 2012; see also Saris, Satorra, and Sörbom 1987). Thus,
It becomes clear that in order to obtain unbiased estimates, public the alpha level may vary in accordance with the number of used items and coun-
administration scholars wishing to compare try groups.
countries or even regions (such as U.S. states) In order to obtain unbiased 9. Mixtures are composed of groups of countries that share
are advised to test the equivalence of their estimates, public administration the same posterior probability of responding (Vermunt
measurement constructs. Otherwise, they pro- and Magidson 2005). These mixtures are specified to be
ceed in assuming equivalence, which can be,
scholars wishing to compare categorical (using effect coding for model identification),
as we have shown, a very strong assumption. countries or even regions (such yielding discrete random effects (Stegmueller 2011).
However, testing the nonequivalence assump- as U.S. states) are advised to test 10. For model identification, the item bias of one item
tion is straightforward and favorable, as it puts the equivalence of their measure- must be set to be zero—this is comparable to the
scholars in the position to test the geographic ment constructs. MGCFA approach, in which one item has to be
scope of their theories in a valid manner. This utilized as the referent.

11. The EU-25 counties include Austria (AT), Belgium (BE), Cyprus (CY), Czech Jackson, Dennis L. 2003. Revisiting Sample Size and Number of Parameter
Republic (CZ), Denmark (DK), Estonia (EE), Germany (DE), Greece (GR), Estimates: Some Support for the N:q Hypothesis. Structural Equation Modeling
Finland (FI), France (FR), Hungary (HU), Ireland (IE), Italy (IT), Latvia (LV), 10(1): 128–41.
Lithuania (LT), Luxembourg (LU), Malta (MT), Netherlands (NL), Poland James, Oliver. 2009. Evaluating the Expectations Disconfirmation and Expectations
(PL), Portugal (PT), Slovakia (SK), Slovenia (SI), Spain (ES), Sweden (SE), and Anchoring Approaches to Citizen Satisfaction with Local Services. Journal of
the United Kingdom (UK). Public Administration Research and Theory 19(1): 107–23.
12. More specifically, Sass (2011, 354) proposes to set the variance of the unobserved Jilke, Sebastian. 2014. Choice and Equality: Are Vulnerable Citizens Worse Off after
latent variable equal to one for all groups. By doing so, a referent would not be Liberalization Reforms? Public Administration. Published electronically on July
needed, as the scale is already identified. On this basis, he recommends estimat- 28. doi:10.1111/padm.12102.
ing a fully constrained measurement model and then using unstandardized over- Johnson, Timothy, Patrick Kulesa, Young Ik Cho, and Sharon Shavitt. 2005. The
all modification indices for items’ slopes and intercepts to select the referent. Relation between Culture and Response Styles: Evidence from 19 Countries.
13. This is after deleting those individuals who did not answer any of the trust items Journal of Cross-Cultural Psychology 36(2): 264–77.
(1 percent). Jöreskog, Karl G. 1971. Simultaneous Factor Analysis in Several Populations.
14. Estimating the same model using continuous random effects provides no better Psychometrika 36(4): 409–26.
fit to the data than using discrete random effects; results are available from the Kampen, Jarl K., Steven Van de Walle, and Geert Bouckaert. 2006. Assessing
authors upon request. the Relation between Satisfaction with Public Service Delivery and Trust in
15. If we selected the final model merely on the basis of the log-likelihood, we would Government: The Impact of the Predisposition of Citizens toward Government
select model 9, with a total of 11 mixtures. However, for our model, we used on Evaluations of Its Performance. Public Performance and Management Review
the information theory-based fit measures because they explicitly discriminate 29(4): 387–404.
against increasing model complexity. Kankaraš, Miloš, Jeroen K. Vermunt, and Guy Moors. 2011. Measurement
Equivalence of Ordinal Items: A Comparison of Factor Analytic, Item Response
References Theory, and Latent Class Analysis Approaches. Sociological Methods and Research
Bollen, Kenneth A. 2002. Latent Variables in Psychology and the Social Sciences. 40(2): 279–310.
Annual Review of Psychology 53: 605–34. Kim, Sangmook, Wouter Vandenabeele, Bradley Wright, Lotte Andersen, Francesco
Brower, Ralph S., Mitchel Y. Abolafia, and Jered B. Carr. 2000. On Improving Paolo Cerase, Robert K. Christensen, Céline Desmarais, Maria Koumenta, Peter
Qualitative Methods in Public Administration Research. Administration & Leisink, Bangcheng Liu, Jolanta Palidauskaite, Lene Holm Pdersen, James
Society 32(4): 363–97. L. Perry, Adrian Ritz, Jeanette Taylor, and Paola De Vivo. 2013. Investigating
Byrne, Barbara M., Richard J. Shavelson, and Bengt Muthén. 1989. Testing for the Structure and Meaning of Public Service Motivation across Populations:
the Equivalence of Factor Covariance and Mean Structure: The Issue of Partial Developing an International Instrument and Addressing Issues of Measurement
Measurement Invariance. Psychological Bulletin 105(3): 456–66. Invariance. Journal of Public Administration Research and Theory 23(1): 79–102.
Charbonneau, Étienne, and Gregg G. Van Ryzin. 2012. Performance Measures and King, Gary, Christopher J. L. Murray, Joshua A. Salomon, and Ajay Tandon. 2004.
Parental Satisfaction with New York City Schools. American Review of Public Enhancing the Validity and Cross-Cultural Comparability of Measurement in
Administration 41(1): 54–65. Survey Research. American Political Science Review 98(1): 191–207.
Chen, Fang Fang. 2007. Sensitivity of Goodness of Fit Indexes to Lack of Kline, Rex B. 2011. Principles and Practice of Structural Equation Modeling. 3rd ed.
Measurement Invariance. Structural Equation Modeling 14(3): 464–504. New York: Guilford Press.
Davidov, Eldad, Bart Meuleman, Jaak Billiet, and Peter Schmidt. 2008. Values and ———. 2013. Exploratory and Confirmatory Factor Analysis. In Applied
Support for Immigration: A Cross-Country Comparison. European Sociological Quantitative Analysis in the Social Sciences, edited by Yaacov Petscher, Christopher
Review 24(5): 583–99. Schatschneider, and Donald L. Compton, 171–207. New York: Routledge.
Davidov, Eldad, Bart Meuleman, Jan Cieciuch, Peter Schmidt, and Jaak Billiet. Lee, Geon, Jennifer Benoit-Bryan, and Timothy P. Johnson. 2012. Survey Research
2014. Measurement Equivalence in Cross-National Research. Annual Review of in Public Administration: An Assessment of Mainstream Public Administration
Sociology 40: 55–75. Journals. Public Administration Review 72(1): 87–97.
Favero, Nathan, and Kenneth J. Meier. 2013. Evaluating Urban Public Schools: Little, Roderick J. A., and Donald B. Rubin. 2002. Statistical Analysis with Missing
Parents, Teachers, and State Assessments. Public Administration Review 73(3): Data. 2nd ed. Hoboken, NJ: Wiley.
401–12. Marien, Sofie, and Marc Hooghe. 2011. Does Political Trust Matter? An Empirical
Grimmelikhuijsen, Stephan G., and Albert J. Meijer. 2014. Effects of Transparency Investigation into the Relation between Political Trust and Support for Law
on the Perceived Trustworthiness of a Government Organization: Evidence from Compliance. European Journal of Political Research 50(2): 267–91.
an Online Experiment. Journal of Public Administration Research and Theory Meredith, William. 1993. Measurement Invariance, Factor Analysis, and Factorial
24(1): 137–57. Invariance. Psychometrika 58(4): 525–34.
Herian, Mitchel N., and Alan J. Tomkins. 2012. Citizen Satisfaction Survey Data: A Meuleman, Bart. 2012. When Are Item Intercept Differences Substantively Relevant
Mode Comparison of the Derived Importance-Performance Approach. American in Measurement Invariance Testing? In Methods, Theories, and Empirical
Review of Public Administration 42(1): 66–86. Applications in the Social Sciences: Festschrift for Peter Schmidt, edited by Samuel
Horn, John L., and J. Jack McArdle. 1992. A Practical and Theoretical Guide to Salzborn, Eldad Davidov, and Jost Reinecke, 97–104. Wiesbaden, Germany:
Measurement Invariance in Aging Research. Experimental Aging Research 18(3): Springer Verlag.
117–44. Meuleman, Bart, and Jaak Billiet. 2012. Measuring Attitudes toward Immigration in
Hox, Joop J., Edith D. de Leeuw, Matthieu J. S. Brinkhuis, and Jeroen Ooms. Europe: The Cross-Cultural Validity of the ESS Immigration Scales. Research and
2012. Multigroup and Multilevel Approaches to Measurement Equivalence. In Methods 21(1): 5–29.
Methods, Theories, and Empirical Applications in the Social Sciences: Festschrift for Millsap, Roger E., and Jenn Yun-Tein. 2004. Assessing Factorial Invariance in
Peter Schmidt, edited by Samuel Salzborn, Eldad Davidov, and Jost Reinecke, Ordered-Categorical Measures. Multivariate Behavioral Research 39(3):
91–96. Wiesbaden, Germany: Springer Verlag. 479–515.

Mishler, William, and Richard Rose. 1997. Trust, Distrust and Skepticism: Popular Shingler, John, Mollie E. Van Loon, Theodore R. Alter, and Jeffrey C. Brider. 2008.
Evaluations of Civil and Political Institutions in Post-Communist Societies. The Importance of Subjective Data for Public Agency Performance Evaluation.
Journal of Politics 59(2): 418–51. Public Administration Review 68(6): 1101–11.
Morgeson, Forrest V., III. 2013. Expectations, Disconfirmation, and Citizen Skrondal, Anders, and Sophia Rabe-Hesketh. 2004. Generalized Latent Variable
Satisfaction with the U.S. Federal Government: Testing and Expanding the Modeling: Multilevel, Longitudinal, and Structural Equation Models. Boca Raton,
Model. Journal of Public Administration Research and Theory 23(2): 289–305. FL: Chapman & Hall.
Muthén, Linda K., and Bengt O. Muthén. 2010. Mplus User’s Guide. 6th ed. Los Steenkamp, Jan-Benedict E. M., and Hans Baumgartner. 1998. Assessing
Angeles, CA: Muthén & Muthén. Measurement Invariance in Cross-National Consumer Research. Journal of
Newton, Kenneth, and Pippa Norris. 2000. Confidence in Public Institutions: Consumer Research 25(1): 78–107.
Faith, Culture, or Performance? In Disaffected Democracies: What’s Troubling the Stegmueller, Daniel. 2011. Apples and Oranges? The Problem of Equivalence in
Trilateral Countries?, edited by Susan J. Pharr and Robert D. Putnam, 52–73. Comparative Research. Political Analysis 19(4): 471–87.
Princeton, NJ: Princeton University Press. Van de Vijver, Fons J. R. 2003. Bias and Equivalence: Cross-Cultural Perspectives.
Oberski, Daniel L. 2014. Evaluating Sensitivity of Parameters of Interest to In Cross-Cultural Survey Methods, edited by Janet A. Harkness, Fons J. R. Van de
Measurement Invariance in Latent Variable Models. Political Analysis 22(1): Vijver, and Peter Ph. Mohler, 143–56. Hoboken, NJ: Wiley.
45–60. Van de Vijver, Fons J. R., and Kwok Leung. 1997. Methods and Data Analysis for
Pollitt, Christopher. 2011. Not Odious but Onerous: Comparative Public Cross-Cultural Research. Newbury Park, CA: Sage Publications.
Administration. Public Administration 89(1): 114–27. Van de Walle, Steven, and Geert Bouckaert. 2003. Public Service Performance and
Pollitt, Christopher, and Geert Bouckaert. 2011. Public Management Reform: A Trust in Government: The Problem of Causality. International Journal of Public
Comparative Analysis—New Public Management, Governance and the Neo- Administration 29(8–9): 891–913.
Weberian State. Oxford, UK: Oxford University Press. Van de Walle, Steven, and Gregg G. Van Ryzin. 2011. The Order of Questions in a
Poortinga, Ype H. 1989. Equivalence of Cross-Cultural Data: An Overview of Basic Survey on Citizen Satisfaction with Public Services: Lessons from a Split-Ballot
Issues. International Journal of Psychology 24(6): 737–56. Experiment. Public Administration 89(4): 1436–50.
Poznyak, Dimitry, Bart Meuleman, Koen Abts, and George F. Bishop. 2013. Trust in Van Ryzin, Gregg G. 2006. Testing the Expectancy Disconfirmation Model of
American Government: Longitudinal Measurement Equivalence in the ANES, Citizen Satisfaction with Local Government. Journal of Public Administration
1964–2008. Social Indicators Research 118(2): 741–58. Research and Theory 16(4): 599–611.
Raju, Nambury S., Larry J. Laffitte, and Barbara M. Byrne. 2002. Measurement ———. 2011. Outcomes, Process, and Trust of Civil Servants. Journal of Public
Equivalence: A Comparison of Methods Based on Confirmatory Factor Analysis Administration Research and Theory 21(4): 745–60.
and Item Response Theory. Journal of Applied Psychology 87(3): 517–29. ———. 2013. An Experimental Test of the Expectancy-Disconfirmation Theory of
Reeve, Bryce B., and Peter Fayers. 2005. Applying Item Response Theory Modeling Citizen Satisfaction. Journal of Policy Analysis and Management 32(3): 597–614.
for Evaluating Questionnaire Item and Scale Properties. In Assessing Quality of Vandenberg, Robert J., and Charles E. Lance. 2000. A Review and Synthesis of the
Life in Clinical Trials: Methods of Practice, edited by Peter Fayers and Ron D. Measurement Invariance Literature: Suggestions and Recommendations for
Hayes, 55–73. Oxford, UK: Oxford University Press. Organizational Research. Organizational Research Methods 3(1): 4–69.
Reise, Steven P., Keith F. Widaman, and Robin H. Pugh. 1993. Confirmatory Vermunt, J. K., and J. Magidson. 2005. Latent GOLD 4.0 User Manual. Belmont,
Factor Analysis and Item Response Theory: Two Approaches for Exploring MA: Statistical Innovation.
Measurement Invariance. Psychological Bulletin 114(3): 552–56. Vigoda-Gadot, Eran. 2007. Citizens’ Perceptions of Politics and Ethics in Public
Samejima, Fumiko. 1969. Estimation of Latent Ability Using a Response Pattern of Administration: A Five-Year National Study of their Relationship to Satisfaction
Graded Scores. Richmond, VA: William Byrd Press. with Services, Trust in Governance, and Voice Orientation. Journal of Public
Saris, Willem E., Albert Satorra, and Dag Sörbom. 1987. The Detection and Administration Research and Theory 17(2): 285–305.
Correction of Specification Errors in Structural Equation Models. In Whittaker, Tiffany A. 2012. Using the Modification Index and Standardized
Sociological Methodology, edited by Clifford C. Clogg, 105–29. San Francisco: Expected Parameter Change for Model Modification. Journal of Experimental
Jossey-Bass. Education 80(1): 26–44.
Saris, Willem E., Albert Satorra, and William M. Van der Veld. 2009. Testing Williams, Larry J., Robert J. Vandenberg, and Jeffrey R. Edwards. 2009. Structural
Structural Equation Models or Detection of Misspecifications? Structural Equation Modeling in Management Research: A Guide for Improved Analysis.
Equation Modeling 16(4): 561–82. Academy of Management Annals 3(1): 543–604.
Sass, Daniel A. 2011. Testing Measurement Invariance and Comparing Latent Factor Wright, Bradley E., Lepora J. Manigault, and Tamika R. Black. 2004. Quantitative
Means within a Confirmatory Factor Framework. Journal of Psychoeducational Research Measurement in Public Administration: An Assessment of Journal
Assessment 29(4): 347–62. Publications. Administration & Society 35(6): 747–64.

We Need To Compare, But How? Measurement Equivalence in Comparative Public Administration

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

We Need To Compare, But How? Measurement Equivalence in Comparative Public Administration

Uploaded by

Copyright:

Available Formats

Michael McGuire, Editor Bart Meuleman

University of Leuven, Belgium

Research We Need to Compare, but How? Measurement Equivalence

36 Public Administration Review • January | February 2015

The third type of bias is called item bias, or diﬀerential item

Say that we measure citizen satisfaction with public services using

In other words, it has an equivalent conﬁguration across countries. 6

Diﬀerences between the approaches are striking. For example, using

M3 Q released 204.52 95 0.062 0.984 15.35 –0.176

M4 Q released 189.35 94 0.059 0.986 15.17 –0.215

M5 Q released 170.97 93 0.055 0.988 18.38 0.258

M7 Q released 960.18 164 0.109 0.914 68.09 0.428

M8 released 904.00 163 0.106 0.920 56.18 –0.250

M9 Q released 801.29 161 0.099 0.930 102.71 0.176

M10 Q released 756.11 160 0.096 0.935 45.18 –0.238

M12 released 670.74 158 0.090 0.943 40.67 –0.228

M13 Q released 631.84 157 0.087 0.947 38.91 0.260

M15 Q released 564.50 155 0.082 0.954 34.21 0.155

M20 Q released 441.47 150 0.072 0.966 20.27 –0.213

M21 Q released 423.68 149 0.070 0.968 17.79 –0.208

M22 Q released 408.10 148 0.069 0.969 15.58 –0.215

M23 Q released 391.89 147 0.067 0.971 16.21 –0.165

M24 Q released 377.64 146 0.066 0.972 14.25 –0.155

We Need to Compare, but How? Measurement Equivalence in Comparative Public Administration 43

Table 4 Item Bias Effects (Model 8)

We Need to Compare, but How? Measurement Equivalence in Comparative Public Administration 45

46 Public Administration Review • January | February 2015

We Need to Compare, but How? Measurement Equivalence in Comparative Public Administration 47

48 Public Administration Review • January | February 2015

You might also like