Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

European Journal of Psychological Assessment, Vol. 11, Issue 3, pp.

147-–157 © 1995 Hogrefe & Huber Publishers

Ronald K. Hambleton and Anil Kanjee Increasing the Validity of Cross-cultural Assessments

Increasing the Validity of Cross-Cultural Assessments:


Use of Improved Methods for Test Adaptations*
Ronald K. Hambleton and Anil Kanjee
University of Massachusetts at Amherst, U. S. A.

Keywords: Test translations

Translating or adapting psychological and educational tests from one language and culture to other lan-
guages and cultures has been a common practice for almost a hundred years, beginning with Binet’s test
of intelligence. Despite the long history and the many good reasons for adapting tests, proper methods for
conducting test adaptations and establishing score equivalence are not well known by psychologists. The
purpose of this paper is to focus attention on judgmental and statistical methods and procedures for adapting
tests with special focus on procedures for identifying poorly adapted items. When these methods are cor-
rectly applied, the validity of any cross-cultural uses of the adapted test should be increased.

Introduction national comparative study of school achieve-


ment in our history.
3. To reduce costs and save time in developing new
There are three excellent reasons for translating or tests. It is often cheaper and faster to adapt an
adapting psychological and educational tests for use existing test into a second language than to de-
into additional languages and cultures: velop a new test. This is especially true in situ-
1. To enhance fairness in assessment by allowing ations where there is a lack of (a) resources for
persons to be assessed in the language of their assessment development, and (b) assessment ex-
choice. For example, in a recent study to measure pertise.
verbal and non-verbal abilities in South Africa,
students were given the choice to take a test in
any one of five languages (A. R. van den Berg, In addition to these three reasons for adapting tests,
personal communication, June, 1993). Possible there are some lesser important reasons that are
score bias associated with assessing students in sometimes relevant. Brislin (1986) notes that tests
their second or third best language was removed are sometimes adapted to enable the use of existing
and the validity of the test results could be ex- normative data. Such use can be problematic (are
pected to increase. the existing norms relevant with a different popula-
2. To facilitate comparative studies across national, tion of examinees?) but the construction of a new
ethnic and cultural groups, both at an interna- test would certainly require the collection of addi-
tional as well as national level. This is especially tional normative data which can be expensive and
relevant currently with the growing contact and time-consuming.
cooperation of different nations in economic and Another reason for adapting a test is the sense of
educational and cultural spheres, and has resulted security that established and respected tests provide
in an increased need for many nationalities and users. There may even be copyright laws which
groups to know and learn more from and about would make it impossible to produce a similar test
each other. For example, over 40 countries are for a second language group. For example, a re-
participating in the Third International Mathe- search team may prefer to adapt an established and
matics and Science Study (TIMSS) being con- well-known test of state-trait anxiety for their pur-
ducted in 1995 and 1999. This is the largest inter- poses than to construct a new test of state-trait anx-

* Laboratory of Psychometric and Evaluative Research Report No. 275. Amherst, MA: University of Massachusetts,
School of Education.
Paper presented at the 23rd International Congress of Applied Psychology, Madrid, 1994.
148 Ronald K. Hambleton and Anil Kanjee

iety. However, as Brislin (1986) cautions, using an left to carefully designed causal investigations and
adapted version of a test merely because it is well logical analyses of the findings and a review of the
established in one language group does not remove test. Still, DIF studies are invaluable in the process
the necessity for reliability and validity studies in the of test adaption and a review of the methods may
second language group. This point is also made be helpful since they are not well known by cross-
clearly in the new International Test Commission cultural researchers.
(ITC) Guidelines for Adapting Educational and Psy-
chological Tests (Hambleton, 1994).
While the use of adapted tests has been common
in education and psychology from the time of Equivalence and DIF in Cross-
Binet’s first intelligence test in 1905, it must be rec- Cultural/Language Comparisons
ognized that the field of cross-cultural and cross-na-
tional comparisons, is still relatively new. Currently, The attainment of equivalent measures in two lan-
the main methodological concerns revolve around guages is perhaps the most central issue in cross-cul-
(1) methods and procedures for adapting tests, fo- tural/national comparative research (Poortinga,
cusing on establishing the equivalence of scores 1983). If the basis of comparison is not equivalent
(Drasgow & Hulin, 1986; Ellis, 1991; Hambleton, across different groups then valid comparisons
1993; Poortinga, 1983; van de Vijver & Poortinga, across these groups cannot be made. Certainly, ob-
1991b), (2) ways of interpreting and using cross-cul- served scores from the groups taking different tests
tural and cross-national data (Hambleton & Kanjee, are on different scales and are thus not directly com-
1994; Poortinga & Malpass, 1986), and (3) the devel- parable (Drasgow & Kanfer, 1985; Lonner, 1990).
opment and use of guidelines for adapting tests For any comparison between different lan-
(Hambleton, 1993, 1994). guage/cultural groups to be valid, all tests used must
The purpose of this paper is to describe several be demonstrated to be equivalent.
judgmental and statistical methods and procedures Hulin (1987) provides a useful definition of
for adapting tests and to consider several of their equivalence:
strengths and weaknesses. The statistical procedures
noted are considered in the context of identifying If individuals with the same amounts of the trait
differential item functioning (DIF) using item re- being estimated have different probabilities of
sponse theory (IRT) and logistic regression (LR). making a specified response to the item when re-
In this paper the term “adapting tests” is pre- sponding to different language versions of the
ferred to “translating tests” since the former term is items or scales, the items are said to be biased or
actually more descriptive of what happens in prac- nonequivalent (p. 138).
tice (or should happen). Test translation is usually That is, individuals with the same standing on a con-
only one step in the process of preparing a test for struct, say math ability, but belonging to different
use in a second language and culture. The useful dis- groups, say Brazilians and Nigerians, should have
tinction between test translation and test adaptation the same expected observed score on the item mea-
is explained in more detail by Hambleton (1994) suring that construct. Figure 1 shows a situation
and was strongly recommended by the 13 person where no DIF is present; in Figure 2, DIF is present.
panel responsible for the development of the ITC Defined within the framework of differential
Guidelines for Adapting Tests. item functioning (DIF), two or more versions of an
It must be noted at the outset that DIF detection item (or statements from a personality question-
procedures will inevitably identify test items naire) prepared in different languages are assumed
(broadly defined to include all types of assessment to be equivalent when members of each group with
material ranging from multiple-choice and true- the same score (or about the same score) on the con-
false items to essay questions to personality rating struct measured by the test have the same chance of
scales) for other reasons as well as bias, for example, selecting the correct answer on an ability item or
unfamiliar formats, inappropriate content, and type making the same selection on a personality ques-
I errors (items flagged as “DIF” for reasons as sim- tionnaire (Hambleton, Swaminathan, & Rogers,
ple as chance factors alone). The point is that DIF 1991). It must be noted that there is no requirement
studies, are statistical studies, and therefore they can for equal distributions on the construct being mea-
only identify items which may contain problems of sured across the different groups (Drasgow & Kan-
one kind or another. The matter of identifying the fer, 1985). In fact, it is precisely the study of differ-
source of the problem or problems in items must be ences in the two groups that is often of interest in
Increasing the Validity of Cross-cultural Assessments 149

Probability Judgmental Designs for Assessing


1 Equivalence
0.8

0.6
Judgmental designs for establishing test item
equivalence are popular in practice (Hambleton &
0.4 Bollwark, 1991). Two designs are common: (1) for-
0.2 ward-translation designs, and (2) back-translation
designs, and each will be considered in detail. First,
0
-3 -2 -1 0 1 2 3 the topic of selecting translators is considered.
Ability

Selection and Training of Translators


Figure 1. An item showing no DIF in two language groups. The importance of obtaining the services of compe-
tent translators should be obvious. Too often,
though, researchers have tried to go through the
Probability
translation process with a single translator selected
1 Reference
because he/she happened to be available. Choosing
0.8 Target a “friend” or “wife of a colleague who speaks the
language” are common statements found in the
0.6
methodology section of cross-cultural research
0.4 studies. Competent translation work cannot be as-
0.2 sumed. Also, the use of a single translator, no matter
how competent and conscientious, does not permit
0 valuable discussions of independent translations
-3 -2 -1 0 1 2 3
Ability across a group of translators.
But, translators should be more than persons fa-
Figure 2. Item characteristic curves showing uniform DIF
miliar and competent with the languages involved
between target and source language groups.
in the translation. They should know the cultures
very well, especially the target culture (i. e., the cul-
the research. Thus, within any cross-cultural/nation- ture of the language into which the test is being
al comparison, it is possible for some groups to dis- translated). Knowledge of the cultures involved, es-
play higher or lower scores than other groups. In this pecially the target culture, is often essential for an
context, it is important to ensure that any score dif- effective translation. Also, subject matter knowl-
ferences manifested between these groups are not edge of achievement and aptitude tests is essential
due to the failure of the test to provide equivalent is essential. The nuances and subtleties of a subject
scores, and thus the identification and elimination area will be lost on a translator unfamiliar with the
of any DIF is crucial. subject matter. Too often, translators without tech-
It is important to note that the identification and nical knowledge will resort to literal translations
elimination of DIF from any test increases the reli- which are often problematic to target language ex-
ability and validity of scores. Thus, results from two aminees and threaten test validity.
or more groups from which DIF items have been Finally, test translators would benefit from some
removed are more likely to be comparable and thus training in test construction since it is unlikely that
equivalent. One of the ways of increasing this like- test translators can be found with language compe-
lihood is to ensure that appropriate methods and tence, cultural knowledge, subject matter expertise,
procedures are applied whenever two or more and knowledge of test construction. Knowledge of
groups from different language and/or cultural test construction, especially item writing, will reduce
backgrounds are compared. Some popular judg- the chances of test translators creating flaws in items
mental and statistical methods for assessing equiva- which might change item difficulty or attractiveness.
lence are presented next. For example, test translators need to know that
when doing translations, they should not create
clang associations that might lead test-wise exami-
nees to the correct answers, or translate distractors
in multiple-choice items unknowingly so that they
150 Ronald K. Hambleton and Anil Kanjee

have the same meaning. A test translator without In one variation of this design, one or more sam-
any knowledge of the principles of test and scale ples of target examinees answer the target version
construction could easily make test material more of the test and are then questioned by judges about
or less difficult unknowingly and correspondingly the meaning of their responses. Judges decide if the
lower the validity of the test in the target population. responses given reflect a reasonable representation
In one rather infamous example, a translator, in an of the item in terms of cultural and linguistic under-
effort to insure that each translated item had a defi- standing. If a high percentage of examinees present
nite correct answer, embellished the correct answers a reasonable representation of an item (in the target
by making them longer and more informative. Ex- language), the item is then regarded as being
aminees quickly recognized that the longest answer equivalent to the source language. The main judge-
choices were the correct ones, and the target lan- ment is whether the target language examinees per-
guage test was no longer equivalent in difficulty to ceive the meaning of each item on a test in the same
the source language test. way as the source language examinees (Hambleton,
1993). The use of cognitive psychologists may be
helpful in judging the equivalence of meaning in the
Forward-Adaptation Designs source and target language versions of a test. The
advantage of this version of the forward-translation
In this design, the source version of a test is first design is that valuable information about the func-
translated into the target language by several trans- tioning of any item is provided directly by the ex-
lators working individually or in small groups work- aminees, information that is otherwise unavailable
ing independently (Hambleton, 1994). Then, the when examinees only respond to questions on pa-
translators or small groups come together and work per. However, the disadvantage is that there are
out a combined translation that best represents the many factors (personal, cultural, linguistic) during
views of the translators. Sometimes a new group of the interaction between examinees and judges that
translators may enter the process and organize the can quite easily interfere with the results. For exam-
available translations into what they believe to be ple, judges can easily misinterpret, misunderstand
the best single translation. and/or misrepresent responses of target examinees.
Hambleton and Bollwark (1991) note that com- Another disadvantage is that this method is labor
parisons of multiple translations can be made on the intensive and time consuming compared to other
basis of having translators simply look the items judgmental methods (Hambleton, 1993). The third
over, check the characteristics of items against a disadvantage is that if the test used by source lan-
checklist of item characteristics that may introduce guage monolinguals is not valid or the meaning of
non-equivalence, or by having them attempt to an- responses from examinees are not fully understood,
swer both versions of the item before comparing comparing the results to target language monolin-
them for errors. This design is excellent because of guals is meaningless. That is, one has to be certain of
the use of multiple translators, insuring that the idi- the meaning of responses from source language
osyncrasies, blind-spots, and shortcomings of partic- monolinguals before judging responses from target
ular translators, do not dominate the process. Three language monolinguals, as the former (subjective)
problems with the forward adaptation design are: interpretations of the responses of examinees are
(1) it is often difficult to find translators who are the basis by which the latter responses are judged.
equally familiar with the source and target lan- In another variation of the forward translation
guages and/or culture, (2) translators may inadver- design, bilingual judges (perhaps translators or bi-
tently use insightful guesses (though this problem is lingual cognitive psychologists) study the source
substantially reduced with the use of multiple trans- and target language versions of a test to assess
lators), and (3) translators may not think about the equivalence. Changes can be made in an adapted
items in the same way (or be able to think about test based upon information provided from the
them) as the respective source and target language judgmental review.
monolinguals and thus the results may not be gen-
eralizable. Despite these problems, forward transla-
tion designs have considerable merit and must cer- Back-Adaptation Designs
tainly be a part of any test adaptation process. Be-
cause of the problems, this design would almost In back-adaptation designs, the source language test
never provide sufficient evidence to justify the use is first translated into the target language by several
of an adapted test. translators, and then translated back into the origi-
Increasing the Validity of Cross-cultural Assessments 151

nal language by a different set of translators (Brislin, mental designs for assessing equivalence does not
1986). Equivalence is usually assessed by having provide adequate evidence of equivalence because
source language judges check for errors between examinee responses to the tests are not collected and
the original and back-translated versions of the test. carefully analyzed. The ultimate criterion of test
The main advantage of this design is that re- equivalence must come from an analysis of the actual
searchers who are not familiar with the target lan- responses of examinees to the tests. Since examinees
guage can examine both versions of the source lan- are often operating at a different cognitive level than
guage test to gain some insight into the quality of translators (and under test-taking conditions), it is
the translation (Brislin, 1976). Also, this design can highly possible that the translation found to be ac-
easily be adapted such that a monolingual re- ceptable by translators may not actually be so in prac-
searcher (assessment or subject specialist) can tice. Hambleton (1993) notes that most of the avail-
evaluate (and thus improve) the quality of the trans- able evidence from item bias review studies suggests
lation after it has been translated into the target lan- that judges are not very successful at predicting items
guage, but before it is back-translated into the on a test that function differentially in two or more
source language. groups (e. g., males versus females, blacks versus
The main disadvantage of this design is that the whites). To this end, the suggested practice is that
evaluation of test equivalence is carried out in the judgmental methods should be supplemented with
source language only. It is quite possible that the appropriate statistical methods as well (Bracken &
findings in the source language version do not gen- Barona, 1991; Hambleton, 1993; Prieto, 1992). Sta-
eralize to the target language version of the test. This tistical methods are considered next.
might happen if the translators use a shared set of
translation rules that insures that the back-translated
test is similar to the original test (Hambleton, 1993).
Another disadvantage is that the assumption that Statistical Methods for Assessing
errors made during the original translation will not Equivalence
be made again during the back-translation is not al-
ways applicable (Hambleton & Bollwark, 1991). The statistical methods employed to identify DIF
Often skilled and experienced translators use “in- between two (or more) tests in different languages
sight” to ensure that items translated are equivalent, are characterized by the (1) statistical design, and
even though this may not be true. This, however, can (2) statistical procedures used. The statistical design
be controlled by either using a group of bilingual used is dependent on the characteristics of the sam-
translators or a combination of bilinguals and ples of examinees or participants (that is, monolin-
monolinguals to perform multiple translations to gual or bilingual), and on the version of the adapted
and from the target and source languages (Bracken test (that is, original, translated or back-translated)
& Barona, 1991; Brislin, 1986). For example, (1) (Hambleton & Bollwark, 1991), while the statistical
Brislin (1986) suggested the use of monolinguals to procedure(s) selected are dependent on whether a
check the translated version of the test and make common scale for the two or more adapted versions
necessary changes before it is back-translated and of the test is assumed and whether conditional or
compared to the original version; (2) once the two unconditional procedures are applied (van de Vijver
versions of a test are as close as possible, Bracken & Poortinga, 1991a). These factors determine the
and Barona (1991) suggested the use of a bilingual specific analytical procedures (factor analysis, item
committee of judges to compare the original (or response theory, logistic regression, etc.) which are
back-translated) and the translated version of the best suited to identify DIF, and thus a thorough and
test to ensure that the translation is appropriate for complete understanding of the current applicable
examinees. statistical procedures is important. In the next sec-
tion, a brief explanation of some of the applicable
statistical designs and procedures currently used is
Summary presented.
The judgmental methods include the use of (1) for-
ward-adaptation, and (2) back-adaptation designs. Statistical Designs
Both designs can provide researchers with valuable
information about the equivalence of source and tar- The three statistical designs discussed in this section
get language tests. However, the sole use of judg- are based on whether examinees used to assess item
152 Ronald K. Hambleton and Anil Kanjee

equivalence are: (1) bilingual, (2) both source and take the versions of the test in their respective lan-
target language monolinguals, or (3) only source guages, the results are more generalizable to their
language monolinguals. respective populations. A major problem in practice
is that since two different samples of examinees are
Bilingual Examinees. In this design, both the source compared, the resulting scores may be confounded
and target versions of the test are administered to by real ability differences in the groups compared
bilingual examinees, and the two sets of scores are (Hambleton, 1993). Such confounding complicates
then compared. Care is taken to ensure that the or- the search for poorly adapted assessment material.
der of test presentation is counter-balanced and that However, alternative steps can be taken to mini-
the time between administrations is short enough mize this problem (Bollwark, 1991). First, exami-
that examinee ability scores measured by the test nees selected for the groups should be matched as
are not likely to change. The advantage of this de- closely as possible on the ability/abilities of interest.
sign is that since the same examinees take both ver- Matching should be based on criteria that are rele-
sions of the test, differences in the abilities of exami- vant to the purpose of assessment. For example,
nees that can confound the evaluation of equiva- scores from tests that assess correlated tasks/abili-
lence of adapted versions of a test will be controlled ties could be used. If such information is unavailable,
(Hambleton & Bollwark, 1991). One disadvantage examinee samples should be chosen using the most
of this design is that due to time constraints, exami- available information about the ability level of each
nees might not be able to take both versions of the sample, for example, years and type of schooling
tests. A variation of this design that overcomes this and/or demographic data may be used. Second, con-
problem of time is to split the bilingual sample and ditional statistical procedures that take into account
randomly assign examinees to only one version of the ability of examinees when comparing test scores
the test. Now, the item and test performance of the on an test can also be used to control for ability dif-
randomly equivalent groups can be compared. ferences in the source and target examinee samples,
The problem of differences between the exami- for example, procedures based on item response
nees with respect to their “level” of bilingualism theory, the Mantel-Haenszel statistic, and/or logistic
and/or “level” of biculturalism could still violate the regression. In fact, the use of conditional statistical
assumption of equal abilities between examinees procedures has become a standard way to control
(Hambleton, 1993). Another even more serious pro- for real group differences when conducting a search
blem is that the results obtained from bilingual ex- for problematic items due to poor adaptations and
aminees may not be generalizable to the respective other problems.
source language monolinguals (Hulin, 1987). Also, Last, factor analysis or any other statistical pro-
with regard to cross-national studies, the use of this cedure where no common scale is assumed are often
design is not a feasible option as it is very difficult used in conjunction with this design. For example, in
to find individuals who are equally familiar with the factor analysis, scores of the two groups are sepa-
cultures and languages of the nationalities being rately analyzed to determine the similarity of the
compared. Language dominance tests are available factor structures across the two groups. However,
for the most common languages, and they could be the disadvantage is that since factor analysis is based
helpful, but concerns about their validity exist too. on classical item statistics, the factor analytic find-
In any case, language dominance tests exist for only ings are sample dependent (Hambleton & Boll-
a few language combinations anyway and so this so- wark, 1991). The use of representative samples in
lution is of only limited value. each population of interest can be helpful. Even
with non-representative samples, researchers must
Source and Target Language Monolinguals. In this check that the ordering of item difficulties is the
design, source language monolinguals take the same in the two versions of the test.
source version and target language monolinguals
take the target version of a test (Brislin, 1986; Can- Source Language Monolinguals. In this design, test
dell & Hulin, 1986; Ellis, 1989; 1991; Hulin & Mayer, equivalence is based on the scores of source lan-
1986). The source version can either be the original guage monolinguals who take both the original and
or back-translated version of the test (Brislin, 1986). the back-translated versions of the test. The advan-
The two sets of scores are then compared to deter- tage is that the same sample of examinees is used
mine the equivalence between the two versions of and thus scores are not confounded by examinee
the test. The main advantage of this design is that differences. A major problem, however, is that no
since both source and target language monolinguals data on the performance of target language exami-
Increasing the Validity of Cross-cultural Assessments 153

nees, nor the translated version of the test is col- Haenszel (Holland & Thayer, 1988) and the logistic
lected. Thus information about possible problems regression (Swaminathan & Rogers, 1990) proce-
concerning the target group version is not available, dures are perhaps the best known of the many non-
making the usefulness of this design very limited. IRT based alternatives proposed to detect DIF (Ro-
Still, data from this design can have some value, al- gers & Swaminathan, 1994). Compared to IRT pro-
beit limited, in the identification of problems in the cedures, both these procedures are easier to use and
adaptation of a test. Items identified as functioning understand, are readily available, are applicable to
differently in the two administrations may be items relatively small sample sizes, and are associated with
which need to be checked in the adapted version of significance tests to aid in interpreting the DIF sta-
the test. The bigger problem with this design is that tistic (Hambleton & Rogers, 1989).
problematic items in the adapted version of the test
go undetected since the adapted version is not di-
rectly considered. Simultaneous Detection of DIF in
Multiple Groups
Statistical Procedures to Detect DIF DIF detection techniques have been applied pri-
marily using pairwise comparisons. However, in
Statistical procedures based on IRT are considered some situations it may be necessary to assess DIF in
by many researchers to provide a more theoretically more than two groups, for example cross-cultural or
sound approach for the study of DIF (Camilli & cross-national studies. In this context, pairwise com-
Shepard, 1994; Linn & Harnisch, 1981; Shepard, parisons may be problematic. Multiple pairwise
Camilli & Williams, 1985). In addition, Scheuneman comparisons can prove to be very time consuming
and Bleistein (1989) note that the three-parameter and costly. For example, in South Africa where there
IRT model is preferred over the one- and two-para- are 11 official language groups, the use of pairwise
meter models. (Item model fit is desirable before comparisons would entail 55 separate comparisons.
comparisons across groups are made.) The major As a second example, Keeves (1992) noted that a
advantage of IRT methods with regard to the detec- total of 24 countries participated in the Second IEA
tion of DIF is the property of population invariance. Science Study. This would require an even greater
However, a major disadvantage, especially for the number of comparisons. The process is further com-
two-and three-parameter models, is that relatively plicated because (1) the items flagged as DIF may
large sample sizes are required for estimating pa- differ with each comparison and since all flagged
rameters which are sometimes difficult to attain in items need to be accounted for, DIF studies in many
practice (Zieky, 1993). Hambleton, Swaminathan, real life situations would become absurd, and (2) the
and Rogers (1991) provide the details for the use of need to apply the two-stage procedure (Holland &
IRT in the detection of flawed test items. Thayer, 1988) would double the number of compari-
A review of the psychometric literature indicates sons, yet again.
that many alternatives to the IRT based procedures A possible solution would be to assess DIF simul-
have been proposed. Alternatives have been sought taneously in all the groups, or at least reduce the
which are somewhat less labor intensive than IRT number of comparisons conducted without exclud-
and that can detect potentially flawed items with ing or eliminating any group(s) from the analysis.
smaller samples. A discussion of all these proce- Unlike pairwise comparisons, in simultaneous com-
dures used to detect DIF or to assess item equiva- parisons, the data from all the different groups in the
lence is beyond the scope of this paper. Some of the study are always included in all comparisons that
more popular procedures that are currently used in- are conducted. For example, when comparing 4
clude logistic regression (Bennett, Rock & Kaplan, groups, the existence of DIF is determined by com-
1987; Swaminathan & Rogers, 1990), the Mantel- paring group 1 to groups 2, 3 and 4 combined. The
Haenszel procedure (Hambleton & Rogers, 1989; advantages of this is that (1) provided the total sam-
Hambleton, Clauser, Mazor, & Jones, 1993; Holland ple size is large, analysis can be conducted on many
& Thayer, 1988; Schmitt, Holland & Dorans, 1993), different groups even though some of these groups
the standardization procedure (Dorans, 1989; may consist of relatively small sample sizes, (2) the
Dorans & Holland, 1993), and factor analytic pro- total number of (multiple) pairwise comparisons are
cedures (Knol & Berger, 1991; Mayberry, 1984; reduced, especially when many groups are com-
Royce, 1988; Triandis, 1976). pared, and (3) the performance of different individ-
It must be noted, however, that the Mantel- ual groups can be directly compared to that of, what
154 Ronald K. Hambleton and Anil Kanjee

Ellis and Kimmel (1992) call the “composite group” (Swaminathan & Rogers, 1990). In addition, these
(that includes the combined responses of all groups procedures are able to accommodate small samples,
in the study), which serves as a point of reference. are associated with well-accepted statistical tests of
Ideally, only a single estimation should be re- significance, and condition on observed, rather that
quired to detect any DIF, either for or against any latent, scores (Bennett, Rock & Kaplan, 1987; Hills,
of the groups in the study. In this respect, the use of 1989; Swaminathan & Rogers, 1990). Also, logistic
IRT based procedures seems ideal. Ellis & Kimmel regression procedures are easier to use and under-
(1992) used IRT procedures to determine “unique stand than IRT-based procedures, and are readily
cultural responses patterns” of English, German available in standard statistical computer packages
and French subjects regarding their attitude to- (Hills, 1989; Hosmer & Lemeshow, 1989). Compared
wards mental health. In their study, Ellis & Kimmel to the Mantel-Haenszel procedure, the LR proce-
(1992) combined the responses of all subjects into dure has the advantages that non-uniform DIF can
what they called the “omnicultural composite” be detected, and conditioning can easily be ex-
group, which served as a reference against which the tended to multiple variables.
responses of all other groups were compared. Items The LR model is based on the equation
which differed significantly from the omnicultural
composite were indicative of cultural responses that e(β0 + β1X)
P(u = 1 X) = (1)
are unique for the group under investigation. With 1 + e(β0 + β1X)
regard to DIF, Ellis and Kimmel (1992) noted that:
where u is the response to the item, X is the ob-
in the same way that a DIF item identified a two-
served ability of the individual, β0 is the intercept
group comparison indicates that an item func-
parameter, and β1 is the slope parameter. The use of
tions differentially for the two groups, a DIF item
the LR model to detect DIF was proposed by
identified in a comparison between an individual
Swaminathan and Rogers (1990) as it takes into ac-
culture and an omnicultural composite indicates
count the continuous nature of the ability scale, is
that an item functions differentially for the indi-
able to detect uniform and nonuniform DIF, and en-
vidual cultures compared to the omnicultural
ables the incorporation of two or more covariates
composite (p. 178).
into the equation. The equation:
If the approach used by Ellis and Kimmel (1992) is
adopted for DIF studies involving multiple groups, e(β0 + β1jX1j
P(uij = 1 | Xij) = (2)
all items that are flagged for any specific group 1 + e(β0j + β1jX1j)
could simply be regarded as being in favor of or
against a specific group. Removing (or revising) where i = 1, . . . , nj (representing the examinees) and
these DIF items result in equivalent scores, and thus j = 1, 2 (representing the number of groups), was
the performance of groups studied can be directly used by the authors to identify any DIF for the two
compared. The point is that the omnicultural refer- groups of interest. If β01 = β02 (i. e., intercepts are
ence group, free of any DIF, represents an underly- equal) and β11 = β12 (i. e., slopes are equal) the logis-
ing construct that is (equally) common to all the tic curves of the item for the two groups of interest
groups studied, and thus no single group would have are the same, and thus the item does not display any
any advantage over any other group, for whatever DIF. If, however β01 ≠ β02 and β11 = β12, the curves
reason, for example, differences in access to re- are parallel but not coincident and hence uniform
sources, different curricula, etc. However, besides DIF may be inferred. On the other hand, if β11 ≠ β12,
the Ellis and Kimmel (1992) study, data regarding the curves are not parallel and thus non-uniform
the performance of currently available techniques DIF exists, irrespective of whether the intercepts
to simultaneously detect DIF in multiple groups are (β01, β02) are equal or not.
not generally available.

Logistic Regression Procedure Conclusions


A viable alternative to IRT procedures that ad-
dresses some of its disadvantages is to use the logis- From the above discussion of the various judg-
tic regression procedure. Like item response models, mental and statistical methods of assessing equiva-
logistic regression procedures are also model based lence and identifying DIF, it is evident that a great
and can account for the continuous nature of ability deal more research and information is still required
Increasing the Validity of Cross-cultural Assessments 155

before the proper application of these procedures is Author’s Address:


mastered, and the questions that these comparisons Prof. Ronald K. Hambleton
raise are adequately addressed. This is especially University of Massachusetts
true for cross-cultural/language studies. Some of the Hills South, Room 152
specific problems related to these applications in- Amherst, MA 01002
clude the development of procedures and tech- USA
niques for addressing: (1) cross-cultural compari-
sons involving multiple language groups, (2) small
or modest group sample sizes for DIF and equiva-
lence studies, (3) comparisons between groups with References
[vastly] different ability distributions, (4) issues of
multidimensionality in the data, (5) the use of tests Angoff, W. H., & Cook, L. L. (1988). Equating the scores
from which polytomous data are derived (that is of the Prueba de Aptitud Academica and the Scholastic
item/question format types that include a mixture Aptitude Test (Report No. 88–2). New York, NY: Col-
lege Entrance Examination Board.
of objective and subjective items/questions), and (6)
Bennett, R. E., Rock, D. A., & Kaplan, B. A. (1987). SAT
the development of international standards and differential item performance for nine handicapped
guidelines for conducting cross-cultural/language groups. Journal of Educational Measurement, 24, 41–55.
comparisons, that is the translation, adaptation, ad- Bollwark, J. (1991). Evaluation of IRT anchor test designs
ministration and interpretation of testing tests, as in test translation studies. Unpublished doctoral disser-
well as the reporting and utilization of scores (Ham- tation, University of Massachusetts at Amherst.
bleton, 1993). For excellent discussions of the meth- Bracken, B. A., & Barona, A. (1991). State of the art pro-
odological issues and methods associated with cedures for translating, validating and using psycho-
educational tests in cross-cultural assessment. School
adapting tests, readers are referred to Brislin (1970, Psychology International, 12, 119–132.
1986), Camilli and Shepard (1994), Poortinga and Brislin, R. W. (1970). Back-translation for cross-cultural
van de Vijver (1987, 1991), and van de Vijver and research. Journal of Cross-Cultural Psychology, 1, 185–
Poortinga (1991a). For outstanding examples of test 216.
adaptation projects, see Angoff and Cook (1988) Brislin, R. W. (Ed.). (1976). Translation: Application and
and Woodcock and Munoz-Sandoval (1993). research. New York: John Wiley.
Finally, as noted by van de Vijver and Poortinga Brislin, R. W. (1986). The wording and translation of re-
search instruments. In W. J. Lonner & J. W. Berry
(in press) and is clear from the International Test
(Eds.), Field methods in cross-cultural psychology
Commission test adaptation guidelines (Hamble- (pp. 137–164). Newbury Park, CA: Sage Publishers.
ton, 1994), regardless of the care with which item Camilli, G., & Shepard, L. A. (1994). Methods for identi-
bias (or DIF) studies are carried out, there are two fying biased items. Newbury Park, CA: Sage Publishers.
other prominent sources of bias which must always Candell, G. L., & Hulin, C. L. (1986). Cross-language and
be attended to as well. Construct bias has to do with cross-cultural comparisons in scale translations: Inde-
the extent to which the construct measured by a psy- pendent sources of information about item nonequi-
valence. Journal of Cross-Cultural Psychology, 17, 417–
chological instrument is equivalent in the two or lan- 440.
guage and cultural groups where the instrument is Dorans, N. J. (1989). Two new approaches to assessing dif-
intended for use. Trait bias has to do with the extent ferential item functioning: Standardization and the
to which the instrument itself is free of factors which Mantel-Haenszel method. Applied Psychological Mea-
serve to invalidate the scores for one of the intended surement, 3, 217–233.
groups. Lack of test familiarity or speededness are Dorans, N. J., & Holland, P. W. (1993). DIF detection and
description: Mantel-Haenszel and Standardization. In P.
two factors which could reduce the validity of a psy- W. Holland & H. Wainer (Eds.), Differential item func-
chological instrument. In practice then, DIF studies tioning (pp. 35–66). Hillsdale, NJ: Erlbaum.
will never be sufficient for establishing the utility of Drasgow, F., & Hulin, C. L. (1986). Assessing the equiva-
an adapted instrument in a new language and/or cul- lence of measurement of attitudes and aptitudes across
ture. But they can be quite valuable in identifying heterogeneous subpopulations (unpublished manu-
one big source of error in adapted tests. script). Urbana-Champaign, IL: University of Illinois.
Drasgow, F., & Kanfer, R. (1985). Equivalence of psycho-
logical measurement in heterogeneous populations.
Journal of Applied Psychology, 70, 662–680.
Ellis, B. B. (1989). Differential item functioning: Implica-
tions for test translation. Journal of Applied Psycholo-
gy, 74, 912–921.
Ellis, B. B. (1991). Item response theory: A tool for assess-
156 Ronald K. Hambleton and Anil Kanjee

ing the equivalence of translated tests. Bulletin of the Mayberry, P. W. (1984, April). Analysis of cross-cultural
International Test Commission, 18, 33–51. attitudinal scale translation using maximum likelihood
Ellis, B. B., & Kimmel, H. D. (1992). Identification of factor analysis. Paper presented at the meeting of the
unique cultural response patterns by means of item re- American Educational Research Association, New Or-
sponse theory. Journal of Applied Psychology, 77, 177– leans, LA.
184. Mazor, K., Kanjee, A., & Clauser, B. (in press). Using lo-
Hambleton, R. K. (1993). Translating achievement tests gistic regression with multiple ability estimates to detect
for use in cross-national studies. European Journal of differential item functioning. Journal of Educational
Psychological Assessment, 9, 54–65. Measurement.
Hambleton, R. K. (1994). Guidelines for adapting educa- Poortinga, Y. H. (1983). Psychometric approaches to in-
tional and psychological tests: a progress report. Euro- tergroup comparison: The problem of equivalence. In
pean Journal of Psychological Assessment, 10, 229–244. S. H. Irvine & J. W. Berry (Eds.), Human assessment
Hambleton, R. K., & Bollwark, J. (1991). Adapting tests and cross-cultural factors (pp. 237–258). New York: Ple-
for use in different cultures: technical issues and meth- num.
ods. Bulletin of the International Test Commission, 18, Poortinga, Y. H., & Malpass, R. S. (1986). Making infer-
3–32. ences from cross-cultural data. In W. J. Lonner & J. W.
Hambleton, R. K., Clauser, B. E., Mazor, K. M., & Jones, Berry (Eds.), Field methods in cross-cultural psychology
R. W. (1993). Advances in detection of differentially (pp. 17–46). Beverly Hills, CA: Sage.
functioning test items. European Journal of Psycholog- Poortinga, Y. H., & van de Vijver, F. J. R. (1987). Explain-
ical Assessment, 9, 1–18. ing cross-cultural differences: Bias analysis and beyond.
Hambleton, R. K., & Kanjee, A. (1994). Enhancing the va- Journal of Cross-Cultural Psychology, 18, 259–282.
lidity of cross-cultural studies: Improvements in instru- Poortinga, Y. H., & van de Vijver, F. J. R. (1991). Culture-
ment translation methods. In T. Husen & T. N. Postle- free measurement in the history of cross-cultural psy-
waite (Eds.), International encyclopedia of education chology. Bulletin of the International Test Commission,
(2nd ed.). Oxford, UK: Pergamon Press. 18, 72–87.
Hambleton, R. K., & Rogers, H. J. (1989). Detecting po- Prieto, A. J. (1992). A method for translation of instru-
tentially biased test items: Comparison of IRT area and ments to other languages. Adult Education Quarterly,
Mantel-Haenszel methods. Applied Measurement in 43, 1–14.
Education, 2, 313–334.
Rogers, H. J., & Swaminathan, H. (1994, April) Logistic
Hambleton, R. K., Swaminathan, H., & Rogers, H. J.
regression procedures for detecting DIF in nondichoto-
(1991). Fundamentals of item response theory. Newbury
mous item responses. Paper presented at the meeting of
Park, CA: Sage Publishers.
the National Council on Measurement in Education,
Hills, J. R. (1989). Screening for potentially biased items New Orleans.
in testing programs. Educational Measurement: Issues
Royce, J. R. (1988). The factor model as a theoretical basis
and Practice, 8, 5–11.
for individual differences. In S. H. Irvine & J. W. Berry
Holland, P. W., & Thayer, D. T. (1988). Differential item (Eds.), Human abilities in cultural context (pp. 147–
performance and the Mantel Haenszel procedure. In H. 165). New York: Cambridge University Press.
Wainer & H. I. Braun (Eds.), Test validity (pp. 129–
145). Hillsdale, NJ: Erlbaum. Scheuneman, J. D., & Bleistein, C. A. (1989). A con-
sumer’s guide to statistics for identifying differential
Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic item functioning. Applied Measurement in Education,
regression. New York: Wiley. 2, 255–275.
Hulin, C. L. (1987). A psychometric theory of evaluations
Schmitt, A. P., Holland, P. W., & Dorans N. J. (1993).
of item and scale translations: Fidelity across languages.
Evaluating hypotheses about differential item function-
Journal of Cross-Cultural Psychology, 18, 115–142.
ing. In P. W. Holland & H. Wainer (Eds.), Differential
Hulin, C. L., & Mayer, L. J. (1986). Psychometric equiva- item functioning (pp. 281–315). Hillsdale, NJ: Erlbaum.
lence of a translation of the Job Descriptive Index into
Hebrew. Journal of Applied Psychology, 71, 83–94. Shepard, L. A., Camilli, G., & Williams, D. M. (1985). Va-
lidity of approximation techniques for detecting item
Keeves, J. P. (1992, April). Technical issues in the first and bias. Journal of Educational Measurement, 22, 49–58.
second IEA science studies. Paper presented at the an-
nual meeting of the American Educational Research Swaminathan, H., & Rogers, H. J. (1990). Detecting dif-
Association, San Francisco. ferential item functioning using logistic regression pro-
cedures. Journal of Educational Measurement, 27, 361–
Knol, D. L., & Berger, M. P. F. (1991). Empirical compari-
370.
son between factor analysis and multidimensional item
response models. Multivariate Behavioral Research, 26, Triandis, H. C. (1976). Approaches toward minimizing
457–477. translation. In R. W. Brislin (Ed.), Translation: Appli-
cation and research (pp. 228–243). New York: John
Linn, R. L., & Harnisch, D. L. (1981). Interaction between
Wiley.
item content and group membership on achievement
test items. Journal of Educational Measurement, 18, van de Vijver, F. J. R., & Poortinga, Y. H. (1991a). Testing
109–118. across cultures. In R. K. Hambleton & J. Zaal (Eds.),
Lonner, W. J. (1990). An overview of cross-cultural testing Advances in educational and psychological testing
and assessment. In R. W. Brislin (Ed.), Applied cross- (pp. 277–307). Boston: Kluwer Academic Publishers.
cultural psychology (pp. 56–76). Newbury Park, CA: van de Vijver, F. J. R., & Poortinga, Y. H. (1991b). Cul-
Sage Publications. ture-free measurement in the history of cross-cultural
Increasing the Validity of Cross-cultural Assessments 157

psychology. Bulletin of the International Test Commis- terpretation. European Journal of Psychological As-
sion, 18, 72–87. sessment, 9, 233–241.
van de Vijver, F. J. R., & Poortinga, Y. H. (in press). To- Zieky, M. (1993). Practical questions in the use of DIF sta-
wards an integrated analysis of bias in cross-cultural as- tistics in test development. In P. W. Holland & H.
sessment. European Journal of Psychological Assess- Wainer (Eds.), Differential item functioning (pp. 337–
ment. 348). Hillsdale, NJ: Erlbaum.
Woodcock, R. W., & Muñoz-Sandoval, A. F. (1993). An
IRT approach to cross-language test equating and in-

You might also like