Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

862144

research-article2019
JPAXXX10.1177/0734282919862144Journal of Psychoeducational AssessmentAbdelhamid et al.

Article
Journal of Psychoeducational Assessment
1­–14
A Demonstration of Mokken © The Author(s) 2019
Article reuse guidelines:
Scale Analysis Methods Applied sagepub.com/journals-permissions
DOI: 10.1177/0734282919862144
https://doi.org/10.1177/0734282919862144
to Cognitive Test Validation journals.sagepub.com/home/jpa

Using the Egyptian WAIS-IV

Gomaa S. M. Abdelhamid1,2 , Juana Gómez-Benito1 ,


Ahmed T. M. Abdeltawwab2, Mostafa H. S. Abu Bakr2,
and Amina M. Kazem3

Abstract
The fourth edition of the Wechsler Adult Intelligence Scale (WAIS-IV) has been used
extensively for assessing adult intelligence. This study uses Mokken scale analysis to investigate
the psychometric proprieties of WAIS-IV subtests adapted for the Egyptian population in a
sample of 250 adults between 18 and 25 years of age. The monotone homogeneity model and
the double monotonicity model were consistent with the subtest data. The items of all subtests
except Matrix Reasoning, Information, Similarities, and Vocabulary formed a unidimensional
scale. The WAIS-IV subtests have discriminatory and invariantly ordered items, although some
items violated the invariant item ordering and scalability criteria. Therefore, the WAIS-IV
subtests—with the exception of some items—are hierarchical scales that allow items to be
ordered according to difficulty and subjects to be ordered using the sum score. In conclusion,
the current study provides evidence of the dimensionality and hierarchy of the WAIS-IV
subtests in the framework of Mokken scaling, although care should be taken when interpreting
or including certain items.

Keywords
Mokken, WAIS-IV, item response theory, intelligence

The Wechsler Adult Intelligence Scale (WAIS) is one of the most widely used scales for assess-
ing the cognitive abilities of adults and older adolescents (Benson, Hulac, & Kranzler, 2010;
Salthouse & Saklofske, 2010). The most recent version of the WAIS (Wechsler Adult Intelligence
Scale–Fourth Edition [WAIS-IV]; Wechsler, 2008) updates the base theory of the intelligence
construct, assuming it to be a general concept comprising four indices: Verbal Comprehension
(VC), Perceptual Reasoning (PR), Working Memory (WM), and Processing Speed (PS; Saklofske

1Universityof Barcelona, Spain


2Fayoum University, Egypt
3Ain Shams University, Cairo, Egypt

Corresponding Author:
Juana Gómez-Benito, University of Barcelona, Spain.
Email: juanagomez@ub.edu
2 Journal of Psychoeducational Assessment 00(0)

et al., 2012). The WAIS-IV takes into consideration current concepts of fluid reasoning, WM, and
PS from the Cattell–Horn–Carroll model and contains 15 subtests, five of which are supplemen-
tal (Bowden, Saklofske, & Weiss, 2011a; Kaufman, Salthouse, Scheiber, & Chen, 2016). For
each subtest, it is assumed that items are administrated in ascending order of difficulty; conse-
quently, the start points, reversal rules, basal rules, and discontinue rules are used to reduce the
administration time and estimate participants’ sum scores without having to apply all the test
items (Climie & Rostad, 2011; Weiss, Saklofske, Coalson, & Raiford, 2010).
Several recent studies, including Abdelhamid, Gómez-Benito, Abdeltawwab, Abu Bakr,
and Kazem (2019); Bowden, Saklofske, and Weiss (2011b); Miller, Davidson, Schindler, and
Messier (2013); Nelson, Canivez, and Watkins (2013), have explored the validity of the
WAIS-IV using statistical tools based on classical test theory such as the internal consistency
reliability, factor analysis, and confirmatory factor analysis. Together, this body of studies
provides important insights into the statistical properties of the WAIS-IV, but their conclusions
depend on the total score for the subtests in the framework of classical test theory, which has
many acknowledged limitations. One such limitation is that the item parameter is dependent on
the sample properties and the person parameter is dependent on the specific selection of items
in a test (Embretson, 1996).
Sijtsma, Emons, Bouwmeester, Nyklíček, and Roorda (2008) argued that an instrument should
achieve two requirements to be an efficient measure. First, the true number of dimensions mea-
sured must be clear (e.g., one dimension or multidimensional). In the light of this requirement, if
each subtest of the WAIS-IV assesses one dimension, the sum score of subtest items can be cal-
culated to determine the adult level of the latent trait being measured. However, if the subtest
encompasses two or more dimensions, it is necessary to estimate the sum score for each dimen-
sion that reflects a feature of the latent trait measured. The second requirement is that the psycho-
metric properties of the items must be accurately estimated, as they are necessary in ensuring that
the difficulty and discriminatory power are accurate.
In the literature, there is no discussion of the properties of WAIS-IV subtests adapted for
Arabic speakers (Abdelhamid et al., 2019) in terms of individual item scores or, in particular, the
dimensionality of each subtest. To our knowledge, no study has focused on item statistical prop-
erties or dimensionality in each of the WAIS-IV subtests using modern psychometric theory, such
as item response theory (IRT) or Mokken scale analysis (MSA), which represents the main aim
of the current study.
MSA is a nonparametric procedure that provides a series of methods for examining the rela-
tionship between items and the latent traits being measured and for investigating hierarchies of
items in measures (Watson et al., 2012). It has some advantages over other parametric proce-
dures. First, MSA is less restrictive about the data with regard to the item response function (IRF)
than parametric IRT models (specific shape, logistic like S; Sijtsma & Van der Ark, 2017). This
helps researchers to retain items that would otherwise be omitted from a measure in restrictive
parametric IRT models. Second, MSA provides a set of exploratory tools for dimensionality
analysis, which is not possible in parametric IRT models (Emons, Sijtsma, & Pedersen, 2012).

Nonparametric Item Response Theory (NIRT) Models


MSA uses a set of methods that assesses the fit of two NIRT models. These NIRT models are
known as the monotone homogeneity model (MHM) and, its special case, the double monotonic-
ity model (DMM). The MHM and DMM share a number of assumptions, including unidimesion-
ality, monotonicity, and local independence, whereas the DMM adds nonintersecting item
response nonintersection of IRFs (Sijtsma & Van der Ark, 2017).
The MHM indicates that each item exhibits a monotonic and positive relationship with the
latent variable (Emons et al., 2012). The MHM uses the sum score X + to order individuals
Abdelhamid et al. 3

according to their abilities and for a set of items with monotone homogeneity, the order of indi-
viduals on the latent variable is the same for any of the MHM items (Sijtsma & Molenaar, 2002).
For a group of items in the DMM, this means that the order of the items according to difficulty
(i.e., mean score) is the same for all subjects (Mokken, 1997). These models were extended for
polytomous items by Molenaar (1997), who proposed the polytomous MHM and the polytomous
DMM; a DMM was also proposed by Sijtsma, Debets, and Molenaar (1990).
MSA uses three scalability coefficients: item scalability coefficient (H i), item-pair scalability
coefficient ( H ij ), and scale total scalability coefficient (H). The item-pair scalability coefficient
(H ij ) is defined as the ratio of the covariance between any pair of items i and j and their maximum
possible covariance given the marginal distributions of the two item scores (Mooij, 2012), reflect-
ing the internal consistency of each pair of items. The item scalability coefficient H i is expressed
as the ratio of the sum of all pairwise covariances with regard to any item i and the sum of all
pairwise maximum covariances of this item i, summarizing the accuracy of item discrimination
and the strength of the relationship between the item and the whole set of items (e.g., the trait
scale; Emons et al., 2012). Higher values of H i indicate higher discriminatory power. The scale
total scalability coefficient H is the ratio of the sum of all pairwise covariances and the sum of all
pairwise maximum covariances (Mooij, 2012), investigating the relationship between the sum
score and trait scale. Higher values of H indicate that the means of the total score can be used for
individual ordering with high accuracy. A number of items are considered to be a Mokken scale
when all values of the item-pair scalability coefficient are positive and all item scalability coef-
ficients are greater than 0.30 (Watson et al., 2012).

Assumptions of NIRT Models


The unidimensionality assumption indicates that all items measure the same latent variable
(denoted by θ; Straat, Van der Ark, & Sijtsma, 2013). MSA proposes an automated item selection
procedure (AISP) to select many items that measure the same trait (Mokken, 1971). Straat et al.
(2013) described an item selection procedure based on an objective function using a genetic
algorithm (GA), which examines all the possible partitions of the item pool, and reports the parti-
tion that best represents Mokken’s objective (i.e., to select a set of sufficiently discriminating
items in each cluster). The second assumption is local independence, according to which a per-
son’s response on any item i is independent of his or her responses on any another item j (Sijtsma
& Molenaar, 2002); for example, a person’s response to one item is not affected by the score on
another item. Mokken (1997) showed that sampling independence or independence of responses
between individuals reflects that item parameter estimation is independent of the sample used.
The third assumption is monotonicity of the IRF, which means that IRFs are monotone nonde-
creasing functions of the latent trait θ (Mokken, 1971), which depicts the relationship between
the probability of an individual correct response on item X j and the latent trait level (i.e., a higher
latent trait level corresponds to a higher expected item score). Finally, the nonintersection
assumption indicates that IRFs do not intersect. It includes invariant item ordering (IIO) for
dichotomous data (Sijtsma, Meijer, & Van der Ark, 2011); when the IIO assumption is satisfied
for a set of items, these items form a hierarchical scale from easiest to most difficult.
Numerous studies have used MSA to examine the psychometric properties of various tests,
but despite the advantages of the MHM and its special case, the DMM, described above (e.g.,
Sijtsma & Van der Ark, 2017), to our knowledge, there is no published calibration of the
WAIS-IV scale using MSA. As such, as far as we are aware, the current study is also the first
to examine IIO and dimensionality of WAIS-IV subtests adapted for an Egyptian sample
(Abdelhamid et al., 2019) using those of the two NIRT models. As such, this study was an
important opportunity to advance the understanding of Mokken analysis and to assess the fit of
IRT models for the WAIS-IV subtests analyzed. Therefore, our study has three purposes: (a) to
4 Journal of Psychoeducational Assessment 00(0)

evaluate the dimensionality of each of the WAIS-IV subtests using MSA, (b) to estimate
whether the hierarchy of the items in each of the WAIS-IV subtests could be established with
IIO, and (c) to check the ability of the items in each of the subtests analyzed to identify indi-
vidual differences on the latent trait measured.

Method
Participants
Two hundred fifty normal adults agreed to participate voluntarily in this study. Once informed
consent had been received, the participants were tested from 2015 to 2016 across Egypt.
Participants were aged between 18 and 24 years, with an overall mean age of 20.6 years (SD =
1.7 years), and just more than half the sample (62.7%) was female. Respondents participated
voluntarily. All participants were native speakers of Egyptian Arabic. All participants were eval-
uated individually by psychologists and educators who had received prior training in the applica-
tion of the scale, based on guidelines explained in the administration manual (Wechsler, 2008).
The research reported in this study was part of a project to adapt the WAIS-IV to Arabic speakers,
and permission was obtained from the ethics committee of Fayoum University.

Measures
The WAIS-IV Arabic version was used (Abdelhamid et al., 2019), which, like the English ver-
sion, has 10 core subtests and five supplemental subtests that generate a score for four indices:
VC, PR, WM, and PS. Some of these subtests are scored 0, 1 (e.g., dichotomous data). For PR,
the subtests analyzed were Visual Perception (26 items), Figure Weights (27 items), Picture
Complete (24 items), and Matrix Reasoning (26 items); for WM, Arithmetic (22 items); for
VC, Information (26 items); for PS, Symbol Search (60 items) and Coding (135 items). Others
are scored polytomously (i.e., 0, 1, 2, etc.): for VC, the subtests Similarities (18 items) and
Vocabulary (30 items); for PR, Block Design (14 items); for WM, Digit Span, which has three
subscales (Digit Span-Forward [eight items], Digit Span-Backward [eight items], and Digit
Span-Sequencing [eight items]), and Letter–Number Sequencing (10 items); for PS, Cancelation
(two items).
The WAIS-IV subtests can also be classified as verbal and nonverbal. The verbal tests, such
as Similarities, Vocabulary, Arithmetic and Information, contain some changes to ensure that
they are suitable for the Arabic-speaking population. By contrast, nonverbal tests such as Visual
Puzzles, Figure Weights, Picture Completion, Matrix Reasoning, Block Design, and Symbol
Search were unchanged. In the Letter–Number Sequencing subscale, the English alphabet and
numbers were converted to Arabic script, taking into account the order of letters and numbers
during items adaptation. For the Digit Span-Forward, Digit Span-Backward, and Digit Span-
Sequencing subscales, which only contain numbers, the numerals were converted to the Arabic
equivalents. It should be noted that the WAIS-IV subtests contain a set of easy items that do not
fit the sample used in this study and are, therefore, omitted from the analysis.

Data Analysis
The R package Mokken V 2.8.2 (Van der Ark, 2016) was used to analyze the data for WAIS-IV
subtests. The Mokken scalability coefficients H ij , H j , and H were examined following the cri-
teria proposed by Mokken (1971): A scale is considered weak if .30 ≤ H < .40, a medium scale
if 0.40 ≤ H < 0.50, and strong if H ≥ 0.50; the values of H ij must be greater than zero; and,
finally, if the coefficient H j < 0.30, item j should be reviewed or deleted, but if H j ≥ 0.30, item
Abdelhamid et al. 5

j should be selected to make up the Mokken scale. In addition to the above coefficients, the
dimensionality of each subtest was examined with the genetic algorithm, using values of the
lower bound (Straat et al., 2013); lower bound c indicates the minimum value of discrimination
(H j ) for items that make up the Mokken scale. We start with an initial value c = .0, which is
increased in increments of .05 up to a value of .55, as recommended by Sijtsma and Molenaar
(2002). The scale is unidimensional when all items are selected in one scale for a lower bound of
c ≤ .3; when the values of c increase, these items must not be selected to form the scale. The IIO
assumption was investigated using manifest IIO by number of violations and the backward item
selection method, which removes items in violation of IIO. In addition, we considered the
H-transpose index (HT), which estimates the distance between IRFs; the greater the distance
between IRFs, the greater the IIO.
Thus, the statistic HT was reported as an indicator of the accuracy of the IIO on the basis of
T
the following criteria: H T < .3 signifies that the item ordering is inaccurate, 0.30 ≤ H < 0.40
T T
signifies low accuracy, 0.40 ≤ H < 0.50 signifies medium accuracy, and H ≥ 0.50 signifies
high accuracy (Ligtvoet, Van der Ark, te Marvelde, & Sijtsma, 2010). Finally, the Crit value
proposed by Sijtsma and Molenaar (2002) to check the effect size of a significant violation was
used with the following criteria: if Crit < 40, then violation is minor; if 40 ≤ Crit < 80, violation
is nonserious but must be reviewed by the researcher; and if Crit ≥ 80, violation is serious (Van
Schuur, 2011).
To assess the reliability of the scale, we estimated three reliability coefficients for each subtest
of the WAIS-IV: the lambda-2 statistic (Sijtsma, 2009), as an alternative to Cronbach’s alpha; the
Molenaar–Sijtsma statistic, as a reliability estimator with a smaller bias for MSA (MS; Sijtsma
& Molenaar, 2002); and the latent class reliability coefficient (LCRC), an unbiased statistic of
test-score reliability (Van der Ark, Van der Palm, & Sijtsma, 2011).

Results
Descriptive Statistics and Reliability
The majority of participants were students who were not employed and were not married; only
around 22% were employed and married. With regard to education achievement, the participants
were at different stages of undergraduate study (first year, 22%; second year, 34.8%; third year,
9.2%; and fourth year, 8.8%; or held an undergraduate or postgraduate degree, 25.2%).
As shown in Table 1, the normality assumption held for the WAIS-IV subtests; skewness and
kurtosis values were within the rule of thumb. Table 1 also provides the results obtained with the
three reliability coefficients for the WAIS-IV subtests. Overall, the subscales were shown to have
good internal consistency: The values of the three coefficients ranged from .68 to .99.

MHM Analysis
Tables 2 and 3 provide an overview of Mokken analysis for the WAIS-IV subtests. It should first
be noted that item-pair scalability (Hij) and item scalability (Hj) were positive for all subtest
items. However, Hj > .3 was only achieved for all items of the subtests Visual Puzzles, Arithmetic,
and Block Design, whereas total scalability (H) was .48, .55, and .59, respectively, which indi-
cated medium and strong scales. In the case of the other subtests, some items failed to achieve the
Hj > .3 criterion: one item in Figure Weight and Digit Span; three items in Picture Completion,
Information, and Letter–Number Sequencing; five items in Matrix Reasoning; six items in
Similarities; 11 items in Vocabulary; and 19 items in Symbol Search. The total scalability (H) of
these subtests was between .30 and .65 when no items were deleted (see Table 2). These results
indicate that Similarities, Vocabulary, and Symbol Search are weak; Picture Completion, Matrix
6 Journal of Psychoeducational Assessment 00(0)

Table 1.  Descriptive Statistics and Internal Consistency for the Wechsler Adult Intelligence Scale-IV
Subtests.

Reliability

Scale (factor) Number of items M range Skewness Kurtosis MS λ2 LCRC


VP (PR) 26 0.005-0.995 0.35 −0.780 .86 .86 .91
FW (PR) 27 0.085-0.979 −0.052 −0.451 .89 .88 .93
PC (PR) 24 0.007-0.993 0.485 −0.103 .86 .85 .92
MR (PR) 26 0 .068-0.996 −0.221 −0.221 .86 .85 .89
AR (WM) 22 0.047-0.995 0.328 −0.728 .86 .86 .91
INF (VC) 26 0.005-0.995 0.330 −0.145 .81 .82 .89
SIM (VC) 18 0.09-1.94 0.60 0.259 .73 .73 .80
V (VC) 30 0.21-1.84 −1.008 0.076 .88 .88 .91
BD (PR) 14 0.20-3.40 1.24 1.31 .69 .75 .82
DS (WM) 24 0.07-1.79 0.363 −0.566 .84 .84 .88
LN (WM) 10 0.04-2.99 0.682 0.409 .72 .68 .76
SS (PS) 60 0.01-0.99 0.424 1.20 .94 .92 .96
CD (PS) 135 0.004-0.995 −0.189 −0.211 .98 .96 .99

Note. MS = Molenaar–Sijtsma; λ2 = lambda-2; LCRC = latent class reliability coefficient; VP = Visual Puzzles;
PR = Perceptual Reasoning; FW = Figure Weights; PC = Picture Completion; MR = Matrix Reasoning;
AR = Arithmetic; WM = Working Memory; INF = Information; VC = Verbal Comprehension; SIM = Similarities;
V = Vocabulary; BD = Block Design; DS = Digit Span; LN = Letter–Number Sequencing; SS = Symbol Search;
PS = Processing Speed; CD = Coding.

Reasoning, Information, Letter–Number Sequencing, and Digit Span are medium; and Figure
Weights and Coding are strong. However, when those items that failed to satisfy the Hj criterion
were deleted, the total scalability H was greater than or equal to .5 (strong scales) for all subtests
except Matrix Reasoning, which was medium. These increases in total scalability show a trend
toward greater unidimensionality of subtests.
Additional information about dimensionality can also be found in Table 2, particularly in
the last two columns; we report only the results for c ≤ .30 and c = .40 (the other values did
not show interesting results), which show the results using the genetic algorithm for each
WAIS-IV subtest. In general terms, we found that the items of the subtests Visual Puzzles,
Figure Weights, Arithmetic, Block Design, Digit Span, and Letter–Number Sequencing were
selected to form a scale in each subtest with lower bounds in the range .0 ≤ c ≤ .3. For the
subtests Picture Completion, Matrix Reasoning, and Similarities, with lower bounds in the
range .0 ≤ c ≤ .3, not all the items were selected for the same scale, suggesting that the items
can be divided between two scales for each subtest. Finally, for the subtests Information and
Vocabulary, using the same criterion (.0 ≤ c ≤ .3), the items can form up to four and three
scales, respectively.
Using a slightly more restrictive lower bound criterion of c = .4, the items of the subtests
Visual Perception, Figure Weights, Picture Completion, Arithmetic, Block Design, and Letter–
Number Sequencing formed a single scale in each case, whereas the items of the subtests Matrix
Reasoning, Similarities, and Digit Span formed two scales and those of Information and
Vocabulary formed three scales. In summary, the results fitted the expected pattern of a unidi-
mensional scale for Visual Puzzles, Figure Weights, Picture Completion, Arithmetic, Block
Design, Digit Span, and Letter–Number Sequencing, as described by Sijtsma and Molenaar
(2002), whereas the Matrix Reasoning, Information, Similarities, and Vocabulary subtests were
multidimensional.
Abdelhamid et al. 7

Table 2.  Summary of the Mokken Scaling Analysis Results for the Wechsler Adult Intelligence Scale-IV
Subtests.

Total items violated Genetic algorithm


IIO Number of scales to
Scale Scalability
(factor) H HT (Hi < .30) Monotonicity Total Deleted by BIS c ≤ .30 c = .40
VP (PR) .48 .84 0 0 0 0 1 1
FW (PR) .51 .78 1 0 0 0 1 1
PC (PR) .45 .81 3 0 6 Two items (8 and 10) 2 1
MR (PR) .40 .60 5 0 0 0 2 2
AR (WM) .55 .84 0 0 0 0 1 1
INF (VC) .43 .88 3 0 7 Two items (5 and 8) 4 3
SIM (VC) .30 .65 6 0 2 One item (11) 2 2
V (VC) .32 .38 11 0 12 Three items (9, 10, 21) 3 3
BD (PR) .59 .75 0 0 0 0 1 1
DS (WM) .40 .71 1 0 0 0 1 2
LN (WM) .42 .96 3 0 0 0 1 1
SS (PS) .38 .86 19 0 — — — —
CD (PS) .65 .96 — — — — — —

Note. H = scale scalability; HT = coefficient showing accuracy of item ordering; IIO = invariant item ordering;
BIS = backward item selection; VP = Visual Puzzles; PR = Perceptual Reasoning; FW = Figure Weights;
PC = Picture Completion; MR = Matrix Reasoning; AR = Arithmetic; WM = Working Memory; INF = Information;
VC = Verbal Comprehension; SIM = Similarities; V = Vocabulary; BD = Block Design; DS = Digit Span;
LN = Letter–Number Sequencing; SS = Symbol Search; PS = Processing Speed; CD = Coding.

No significant violations of the monotonicity assumption were detected for the items of each
subtest, but one item of the Matrix Reasoning and Vocabulary subtests had a Crit value in the
range 40 ≤ Crit < 80, showing a nonserious degree of misfit. In addition, strong evidence of
monotonicity was found when inspecting all IRFs of the subtests for all items across the range of
ability. In summary, the results under the MHM indicate that the monotonicity assumption held
for each of the WAIS-IV subtests and that unidimensionality was achieved by all subtests except
Matrix Reasoning, Information, Similarities, and Vocabulary, which reported Mokken multi-
scales (i.e., multifactor).

DMM Analysis
Tables 2 and 3 (V.IIO column) display the statistics for IIO applied to the subtests. The IIO
assumption was also visualized using the Mokken package. The results revealed no significant
IIO violations for any items of the Visual Puzzles, Figure Weights, Matrix Reasoning, Arithmetic,
Block Design, Digit Span, and Letter–Number Sequencing subtests, and the HT coefficient was
greater than .50, indicating a strong ordering.
But, six items (5, 7, 8, 10, 12, and 13) of the Picture Completion subtest violated the IIO
assumption, and Crit values were in the range 40 ≤ Crit < 80, although the backward item
selection method reported only two items (8 and 10) to be removed. Once the two items had
been removed from the Picture Completion subtest, HT was .81 (strong ordering) and the test
scalability H was .48, indicating a medium scale. The Information subtest showed significant
violations of IIO for seven items, but only one item (8) had a Crit value greater than 80, of
approximately 90. Backward item selection confirmed that two items (5 and 8) should be
removed. Once these items had been removed, the HT coefficient for the remaining Information
8 Journal of Psychoeducational Assessment 00(0)

Table 3.  Fit Statistics of MSA Models for Wechsler Adult Intelligence Scale-IV Subtests.

M Hj V.IIO Ranka M Hj V.IIO Ranka M Hj V.IIO Ranka

I Visual Perception Figure Weights Picture Complete


2 0.993 .60 2
3 0.993 .60 3
4 0.964 .53 4
5 0.995 .72 5 0.593 .27 1 7
6 0.995 .72 6 0.880 .21 9 0.686 .33 5
7 0.943 .59 7 0.979 .58 6 0.60 .29 1 6
8 0.759 .47 8 0.838 .43 11 0.436 .30 2 11
9 0.604 .46 10 0.916 .63 7 0.507 .40 9
10 0.675 .53 9 0.901 .44 8 0.579 .50 2 8
11 0.557 .45 11 0.873 .54 10 0.279 .45 14
12 0.467 .47 12 0.768 .33 13 0.429 .49 1 12
13 0.458 .51 13 0.789 .47 12 0.471 .53 1 10
14 0.231 .31 17 0.549 .39 16 0.343 .58 13
15 0.316 .48 14 0.606 .53 14 0.071 .45 18
16 0.274 .50 15 0.444 .45 17 0.114 .53 16
17 0.259 .51 16 0.585 .52 15 0.136 .49 15
18 0.175 .53 18 0.444 .55 18 0.107 .63 17
19 0.151 .52 19 0.380 .62 19 0.036 .62 21
20 0.080 .48 20 0.197 .58 21 0.064 .66 19
21 0.080 .52 21 0.275 .59 20 0.029 .74 22
22 0.076 .50 22 0.148 .52 24 0.021 .59 23
23 0.038 .43 25 0.162 .56 23 0.050 .64 20
24 0.051 .51 23 0.190 .60 22 0.007 .26 24
25 0.042 .50 24 0.134 .61 25  
26 0.005 .61 26 0.085 .54 26  
27 0.092 .65 27  
Matrix Reasoning Arithmetic Information
4 0.996 .67 4 0.995 .86 4
5 0.986 .68 5 0.103 .32 3 17
6 0.891 .01 8 0.995 .60 6 0.846 .22 8
7 0.941 .19 7 0.991 .79 7 0.897 .44 6
8 0.715 .23 10 0.972 .64 8 0.397 .29 2 13
9 0.946 .37 6 0.817 .46 10 0.608 .42 9
10 0.801 .39 9 0.864 .49 9 0.935 .57 5
11 0.669 .24 13 0.596 .47 12 0.061 .30 19
12 0.706 .28 11 0.676 .53 11 0.855 .34 7
13 0.688 .43 12 0.535 .50 13 0.509 .41 10
14 0.643 .35 14 0.441 .55 14 0.196 .47 1 15
15 0.539 .35 17 0.385 .57 15 0.299 .46 14
16 0.543 .42 15 0.277 .55 16 0.407 .54 1 11
17 0.543 .47 16 0.239 .58 19 0.402 .60 1 12
18 0.380 .43 19 0.249 .59 18 0.009 .34 24
19 0.394 .45 18 0.272 .64 17 0.149 .46 1 16
20 0.362 .44 20 0.085 .64 20 0.103 .47 1 18
(continued)
Abdelhamid et al. 9

Table 3. (continued)

M Hj V.IIO Ranka M Hj V.IIO Ranka M Hj V.IIO Ranka

I Visual Perception Figure Weights Picture Complete


21 0.303 .47 22 0.075 .62 21 0.051 .49 20
22 0.335 .49 21 0.047 .63 22 0.019 .29 21
23 0.299 .47 23 0.014 .48 22
24 0.226 .50 24 0.014 .48 23
25 0.122 .45 25 0.009 .56 25
26 0.068 .41 26 0.005 .65 26
Vocabulary Similarities Letter–Number Sequencing
4 2.99 .14 4
5 2.77 .25 5
6 1.71 .08 2 10 1.94 .25 6 2.53 .26 6
7 1.80 .08 2 7 1.55 .11 7 1.53 .45 7
8 1.74 .19 2 9 1.30 .22 8 0.65 .56 8
9 1.09 .14 3 22 0.95 .22 10 0.18 .64 9
10 1.57 .15 2 12 1.16 .26 9 0.04 .59 10
11 1.78 .13 2 8 0.37 .31 1 14 Digit Span (Forward [F], Backward
[B], Sequencing [S])
12 1.84 .20 1 6 0.46 .30 12 F4, 1.79 .40 4
13 1.25 .30 17 0.36 .24 15 F5, 1.34 .44 5
14 0.70 .25 27 0.39 .39 1 13 F6, 0.85 .40 6
15 1.24 .27 18 0.51 .44 11 F7, 0.45 .41 7
16 1.48 .41 13 0.12 .38 16 F8, 0.18 .42 8
17 0.54 .27 29 0.09 .45 18 B3, 1.65 .26 3
18 1.20 .34 19 0.11 .36 17 B4, 1.18 .41 4
19 1.15 .34 1 20 Block Design B5, 0.68 .44 5
20 1.40 .41 15 Item M Hj Ranka B6, 0.30 .45 6
21 1.60 .57 8 11 9 3.40 .49 9 B7, 0.14 .46 7
22 0.98 .44 24 10 2.21 .58 10 B8, 0.07 .41 8
23 1.14 .36 1 21 11 0.80 .56 11 S4, 1.44 .37 4
24 0.90 .36 26 12 0.79 .63 12 S5, 1.05 .37 5
25 1.47 .49 1 14 13 0.43 .65 13 S6, 0.37 .35 6
26 1.40 .45 16 14 0.20 .64 14 S7, 0.13 .42 7
27 1.09 .43 1 23 S8, 0.07 .44 8
28 0.64 .30 28  
29 0.21 .22 30  
30 0.98 .37 25  

Note. I = item; Hi = item scalability; V.IIO = number of significant violations of the invariant item ordering.
aThe item ordering using the mean score.

items was .88, which is a very high value according to Ligtvoet et al. (2010) and H was .48,
indicating a medium scale.
For the Similarities and Vocabulary subtests, the HT coefficient was .65 (strong ordering)
and .38 (weak ordering) and backward item selection confirmed that only one item (11) and
three items (9, 10, and 21) should be removed, respectively. As shown in Figure 1, Item 11
violated the IIO and nonintersection assumptions with Item 15 for the Similarities subtest,
10 Journal of Psychoeducational Assessment 00(0)

Figure 1.  Example violations of the IIO assumption for item-pair 15 and 11 for the Similarities subtest
and item-pairs 20 and 9, and 21 and 10 for the Vocabulary subtest using the manifest IIO method.
Note. IIO = invariant item ordering.

and depicted the intersection between item-pairs 20 and 9, and 21 and 10 for the Vocabulary
subtest. For the Coding and Symbol Search subtests, although strong ordering was achieved,
the backward item selection procedure reported violations by some items, which should be
deleted.
Table 3 (Rank column) displays the item ordering for each subtest using the mean score.
Using this approach, items with a lower mean score are reflected as more difficult. Interestingly,
the ordering was different for some items with respect to the original ranking described in the
WAIS-IV manual.
It can, therefore, be concluded that the MHM and DMM fitted well to the subtest data,
although caution is required with certain items for which a poor fit was reported: In the MHM,
those items that failed to satisfy the Hj criterion, and in the DMM, those items that were removed
following backward item selection.
Abdelhamid et al. 11

Discussion
This study uses two NIRT models to assess the psychometric properties of WAIS-IV subtests. In
reviewing the literature, no data were found on the application of MSA to WAIS-IV data. The
most interesting finding was that MHM fitted all items of the subtests, although a small number
of items fitted poorly as measured by the scalability coefficient. Sijtsma and Molenaar (2002)
noted that a similar fit of the MHM to the one reported in this study suggests that the sum score
of each subtest is a good indicator of the latent trait. From a practical perspective, the sum score
of each subtest can be used to order adults on the latent trait. This is reinforced by the results
obtained with the item scalability coefficient Hi for the WAIS-IV subtests, which indicate that the
subtest items discriminate well between levels of adults on the latent trait, such that adults with a
higher level of intelligence will score higher for each subtest. In any case, although the results
also showed that care should be taken with those items with a scalability coefficient less than .30
when interpreting the total scores, it was not essential to remove these items as H was greater
than .4, indicating that they fitted well to the MHM.
Moreover, strong IIO (HT ≥ .50) was recorded for all WAIS-IV subtests with the exception
of Vocabulary, according to the criteria established by Ligtvoet et al. (2010). One of the issues
that emerges from these findings is that WAIS-IV subtests present hierarchical information
based on the difficulty of each item, and items can be administered in ascending order, using
their difficulty to reduce administration time and applying the discontinue rule if the indi-
vidual fails to answer several consecutive items; for example, the matrix subtest is discontin-
ued if the individual fails to correctly answer three consecutive items. This also makes it
possible to apply the starting rule according to the age of each individual. As expected, there
are differences in item order between the WAIS-IV subtests adapted to Arabic and those
detailed in the U.S. WAIS-IV manual, although the original subtest structures have been
maintained as far as possible. For verbal subtests, the order of item administration should be
changed. For instance, in the Information subtest, most of the items pertain to the Western
canons of geography, science, history, and literature. Specifically, Item 5, “Martin Luther
King,” is very easy for a U.S. sample, whereas individuals from Egypt may find it more dif-
ficult to answer correctly, so it was ranked 17th, whereas Item 10, “Cleopatra,” is very easy
for our sample and was ranked fifth. Similarly, the administration order of some items in the
nonverbal subtests should also be reexamined. For instance, Item 14 in Visual Puzzles was
ranked 17th in our study, and Item 17 in Figure Weights was ranked 15th. On the basis of
these results, applying the WAIS-IV subtests for Arabic speakers as ordered in the U.S.
WAIS-IV manual may have a negative impact on overall scores due to the presence of the
lowest ranked items and their implications for the application of the discontinue rule. In gen-
eral, it seems that WAIS-IV subtest items should be resequenced for Arabic speakers to obtain
more accurate scores. These results match those reported by Suwartono, Hidajat, Halim,
Hendriks, and Kessels (2016), who found that the orders established in the U.S. WAIS-IV
manual were unsuitable for Indonesia.
Moreover, according to Watson et al. (2012) and Ligtvoet et al. (2010), the lack of IIO was
due to the measurement of many items at the same level of latent trait. Therefore, we can infer
from the IIO of WAIS-IV items that they measure different levels of cognitive construct, and
this is confirmed by the variation in mean item scores; for instance, for the Visual Puzzles
subtest, mean scores ranged from .005 (Item 26; very difficult) to .995 (Items 1-5; very easy).
From this, and considering the IIO of the WAIS-IV, we can conclude that the ordering of sub-
jects based on the total score of each WAIS-IV subtest is invariant (Ligtvoet et al., 2010;
Sijtsma & Van der Ark, 2017).
Interestingly, the dimensionality results using the genetic algorithm indicated that the WAIS-IV
subtests analyzed in this study are unidimensional except for Matrix Reasoning, Information,
12 Journal of Psychoeducational Assessment 00(0)

Similarities, and Vocabulary, which are multidimensional. The appearance of more than one
scale for some of the WAIS-IV subtests using Mokken analysis may explain the findings of pre-
vious studies such as Abdelhamid et al. (2019), Bowden et al. (2011a), Weiss, Keith, Zhu, and
Chen (2013a, 2013b), which suggested that some of these subtests were loaded on more than one
factor. As such, the total score of each the WAIS-IV subtests (except multidimensional subtests)
can be computed to determine the adult’s level on the latent trait being measured. For the multi-
dimensional subtests (e.g., Matrix Reasoning, Information, Similarities, and Vocabulary), it is
necessary to calculate the total score for each dimension that reflects features of the latent trait
being measured.
Moreover, the current study used the reliability coefficients (Molenaar–Sijtsma, lambda-2,
and latent class reliability), which revealed high reliability for the subtests, which is an indication
of good quality. These findings are in the line with those of previous studies such as Glass, Ryan,
and Charter (2010).
From an empirical perspective, this study provides new understanding of how to apply
Mokken analysis to intelligence scales and how to assess the fit of NIRT models. Our analysis
has shown that the MHM and DMM fit the WAIS-IV, giving evidence of their highly successful
application in intelligence scales. The current findings should be extrapolated only to the 18 to
24 years age group. Although the sample is expected to be representative of the 18 to 24 years
age group and the data satisfied the normality assumption, care should be taken when drawing
inferences with regard to other age groups and regions. It is unfortunate that our study did not
include the Comprehension and Cancelation subtests. Symbol Search and Coding are speeded
subtests, so the results should be approached with caution; as such, we did not discuss these
results at the item level.
In conclusion, the present study provides several interesting findings on the dimensionality
and hierarchy of WAIS-IV subtests in an MSA framework. In future research, it may be of inter-
est to use different WAIS-IV data (samples from other countries, or extending the current sample,
to include other socidemographic characteristics) to compare NIRT models and establish their
statistical properties. Moreover, the use of a variety of IRT models may yield useful information
from which to draw conclusions about item fit or item weakness. The current findings offer many
suggestions that may improve the WAIS-IV subtests adapted for Arabic speakers. First, consid-
eration should be given to reordering the items of some subtests to obtain more accurate mean
score estimates using modern theoretical approaches such as Mokken analysis. Second, some
WAIS-IV items that did not fit well to the NIRT models could be revised, and some could be
omitted in the construction of a shortened version of the WAIS-IV, which was suggested by pre-
vious studies such as Denney, Ringe, and Lacritz (2015) and Meyers, Zellinger, Kockler, Wagner,
and Miller (2013).

Declaration of Conflicting Interests


The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or
publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publi-
cation of this article: This work was supported by the Egyptian Ministry of Higher Education, Management
of Supporting Excellence, Competitive Excellence Project of Higher Education Institutions (grant 2016),
and by the Agency for the Management of University and Research Grants of the Government of Catalonia
(grant 2017SGR1681). The funders played no role in the study design, data collection and analysis, decision
to publish, or preparation of the article.
Abdelhamid et al. 13

ORCID iDs
Gomaa S. M. Abdelhamid https://orcid.org/0000-0002-9107-9388
Juana Gómez-Benito https://orcid.org/0000-0002-4280-3106

References
Abdelhamid, G. S. M., Gómez-Benito, J., Abdeltawwab, A. T. M., Abu Bakr, M. H. S., & Kazem, A. M.
(2019). Hierarchical structure of the Wechsler Adult Intelligence Scale–Fourth Edition with an Egyptian
Sample. Journal of Psychoeducational Assessment, 37, 395-404. doi:10.1177/0734282917732857
Benson, N., Hulac, D. M., & Kranzler, J. H. (2010). Independent examination of the Wechsler Adult
Intelligence Scale–Fourth Edition (WAIS-IV): What does the WAIS-IV measure? Psychological
Assessment, 22, 121-130. doi:10.1037/a0017767
Bowden, S. C., Saklofske, D. H., & Weiss, L. G. (2011a). Augmenting the core battery with supplementary
subtests: Wechsler Adult Intelligence Scale–IV measurement invariance across the United States and
Canada. Assessment, 18, 133-140. doi:10.1177/1073191110381717
Bowden, S. C., Saklofske, D. H., & Weiss, L. G. (2011b). Invariance of the measurement model under-
lying the Wechsler Adult Intelligence Scale–IV in the United States and Canada. Educational and
Psychological Measurement, 71, 186-199. doi:10.1177/0013164410387382
Climie, E. A., & Rostad, K. (2011). Test review: Wechsler Adult Intelligence Scale. Journal of
Psychoeducational Assessment, 29, 581-586. doi:10.1177/0734282911408707
Denney, D. A., Ringe, W. K., & Lacritz, L. H. (2015). Dyadic short forms of the Wechsler Adult Intelligence
Scale–IV. Archives of Clinical Neuropsychology, 30, 404-412. doi:10.1093/arclin/acv035
Embretson, S. E. (1996). The new rules of measurement. Psychological Assessment, 8, 341-349.
doi:10.1037/1040-3590.8.4.341
Emons, W. H. M., Sijtsma, K., & Pedersen, S. S. (2012). Dimensionality of the Hospital Anxiety and
Depression Scale (HADS) in cardiac patients: Comparison of Mokken scale analysis and factor analy-
sis. Assessment, 19, 337-353. doi:10.1177/1073191110384951
Glass, L. A., Ryan, J. J., & Charter, R. A. (2010). Discrepancy score reliabilities in the WAIS-IV standardiza-
tion sample. Journal of Psychoeducational Assessment, 28, 201-208. doi:10.1177/0734282909346710
Kaufman, A. S., Salthouse, T. A., Scheiber, C., & Chen, H. (2016). Age differences and educational attain-
ment across the life span on three generations of Wechsler Adult Scales. Journal of Psychoeducational
Assessment, 34, 421-441. doi:10.1177/0734282915619091
Ligtvoet, R., Van der Ark, L. A., te Marvelde, J. M., & Sijtsma, K. (2010). Investigating an invariant item
ordering for polytomously scored items. Educational and Psychological Measurement, 70, 578-595.
doi:10.1177/0013164409355697
Meyers, J. E., Zellinger, M. M., Kockler, T., Wagner, M., & Miller, R. M. (2013). A validated seven-subtest
short form for the WAIS-IV. Applied Neuropsychology-Adult, 20, 249-256. doi:10.1080/09084282.2
012.710180
Miller, D. I., Davidson, P. S. R., Schindler, D., & Messier, C. (2013). Confirmatory factor analysis of
the WAIS-IV and WMS-IV in older adults. Journal of Psychoeducational Assessment, 31, 375-390.
doi:10.1177/0734282912467961
Mokken, R. J. (1971). A theory and procedure of scale analysis: With applications in political research.
The Hague, The Netherlands: De Gruyter.
Mokken, R. J. (1997). Nonparametric models for dichotomous responses. In W. J. van der Linden & R. K.
Hambletton (Eds.), Handbook of modern item response theory (pp. 351-368). New York, NY: Springer.
Molenaar, I. W. (1997). Nonparametric models for polytomous responses. In W. J. van der Linden & R. K.
Hambletton (Eds.), Handbook of modern item response theory (pp. 369-380). New York, NY: Springer.
Mooij, T. (2012). A Mokken Scale to assess secondary pupils’ experience of violence in terms of severity.
Journal of Psychoeducational Assessment, 30, 496-508. doi:10.1177/0734282912439387
Nelson, J. M., Canivez, G. L., & Watkins, M. W. (2013). Structural and incremental validity of the Wechsler
Adult Intelligence Scale–Fourth Edition with a clinical sample. Psychological Assessment, 25, 618-
630. doi:10.1037/a0032086
Saklofske, D. H., Zhu, J., Miller, J. L., Weiss, L. G., Babcock, S. E., Cayton, T. G., & Coalson, D. L.
(2012). The Cognitive Proficiency Index for the Canadian Edition of the Wechsler Adult Intelligence
14 Journal of Psychoeducational Assessment 00(0)

Scale–Fourth Edition. Canadian Journal of Behavioural Science/Revue Canadienne des Sciences du


Comportement, 44, 117-123. doi:10.1037/a0026734
Salthouse, T. A., & Saklofske, D. H. (2010). Do the WAIS-IV Tests measure the same aspects of cogni-
tive functioning in adults under and over 65? In L. G. Weiss, D. L. Coalson, D. H. Saklofske, & S. E.
Raiford (Eds.), WAIS-IV clinical use and interpretation: Scientist-practitioner perspectives (pp. 217-
235). San Diego, CA: Academic Press.
Sijtsma, K. (2009). Correcting fallacies in validity, reliability, and classification. International Journal of
Testing, 9, 167-194. doi:10.1080/15305050903106883
Sijtsma, K., Debets, P., & Molenaar, I. W. (1990). Mokken scale analysis for polychotomous items: Theory,
a computer program and an empirical application. Quality and Quantity, 24, 173-188. doi:10.1007/
BF00209550
Sijtsma, K., Emons, W. H. M., Bouwmeester, S., Nyklíček, I., & Roorda, L. D. (2008). Nonparametric IRT
analysis of Quality-of-Life Scales and its application to the World Health Organization Quality-of-Life
Scale (WHOQOL-Bref). Quality of Life Research, 17, 275-290. doi:10.1007/s11136-007-9281-6
Sijtsma, K., Meijer, R. R., & Van der Ark, L. A. (2011). Mokken scale analysis as time goes by: An
update for scaling practitioners. Personality and Individual Differences, 50, 31-37. doi:10.1016/j.
paid.2010.08.016
Sijtsma, K., & Molenaar, I. W. (2002). Introduction to nonparametric item response theory (1st ed.).
Thousand Oaks, CA: SAGE.
Sijtsma, K., & Van der Ark, L. A. (2017). A tutorial on how to do a Mokken scale analysis on your
test and questionnaire data. British Journal of Mathematical and Statistical Psychology, 70, 137-158.
doi:10.1111/bmsp.12078
Straat, J. H., Van der Ark, L. A., & Sijtsma, K. (2013). Comparing optimization algorithms for item selec-
tion in Mokken scale analysis. Journal of Classification, 30, 75-99. doi:10.1007/s00357-013-9122-y
Suwartono, C., Hidajat, L. L., Halim, M. S., Hendriks, M. P. H., & Kessels, R. P. C. (2016). External
validity of the Indonesian Wechsler Adult Intelligence Scale–Fourth Edition (WAIS-IV-ID). ANIMA
Indonesian Psychological Journal, 32(1), Article 16. doi:10.24123/aipj.v32i1.581
Van der Ark, L. A. (2016). R package Mokken V 2.8.2. Retrieved from https://cran.r-project.org/web/pack-
ages/mokken/
Van der Ark, L. A., Van der Palm, D. W., & Sijtsma, K. (2011). A latent class approach to estimating test-
score reliability. Applied Psychological Measurement, 35, 380-392. doi:10.1177/0146621610392911
Van Schuur, W. H. (2011). Ordinal item response theory: Mokken scale analysis. Thousand Oaks, CA:
SAGE.
Watson, R., van der Ark, L. A., Lin, L.-C., Fieo, R., Deary, I. J., & Meijer, R. R. (2012). Item response
theory: How Mokken scaling can be used in clinical practice. Journal of Clinical Nursing, 21, 2736-
2746. doi:10.1111/j.1365-2702.2011.03893.x
Wechsler, D. (2008). WAIS-IV administration and scoring manual. San Antonio, TX: Psychological
Corporation.
Weiss, L. G., Keith, T. Z., Zhu, J., & Chen, H. (2013a). Technical and practical issues in the structure and
clinical invariance of the Wechsler Scales: A rejoinder to commentaries. Journal of Psychoeducational
Assessment, 31, 235-243. doi:10.1177/0734282913478050
Weiss, L. G., Keith, T. Z., Zhu, J., & Chen, H. (2013b). WAIS-IV and clinical validation of the four-
and five-factor interpretative approaches. Journal of Psychoeducational Assessment, 31, 94-113.
doi:10.1177/0734282913478030
Weiss, L. G., Saklofske, D. H., Coalson, D. L., & Raiford, S. E. (2010). Theoretical, empirical and clinical
foundations of the WAIS-IV Index Scores. In L. G. Weiss, D. L. Coalson, D. H. Saklofske, & S. E.
Raiford (Eds.), WAIS-IV clinical use and interpretation: Scientist-practitioner perspectives (pp. 64-
94). San Diego, CA: Academic Press.

You might also like