Professional Documents
Culture Documents
CR 6 Eng
CR 6 Eng
CR 6 Eng
To cite this article: Joseph A. Rios, Hongwen Guo, Liyang Mao & Ou Lydia Liu (2016): Evaluating
the Impact of Careless Responding on Aggregated-Scores: To Filter Unmotivated Examinees or
Not?, International Journal of Testing, DOI: 10.1080/15305058.2016.1231193
Article views: 41
Download by: [La Trobe University] Date: 30 October 2016, At: 09:36
International Journal of Testing, 0: 1–31, 2016
Copyright Ó Eductional Testing Service
ISSN: 1530-5058 print / 1532-7574 online
DOI: 10.1080/15305058.2016.1231193
Correspondence should be sent to Joseph Rios, Educational Testing Service, 660 Rosedale Road,
Princeton, NJ 08540. E-mail: jrios@ets.org
Liyang Mao is now affiliated with IXL Learning.
2 RIOS ET AL.
results that do not accurately reflect examinees’ ability may provide mis-
leading information for important stakeholders such as policy makers and
institutional administrators (American Educational Research Association
[AERA], 2000). As a result, it is important to examine whether examinees
are sufficiently serious when taking the test (AERA, American Psychologi-
cal Association, & National Council on Measurement in Education, 2014).
To this end, the focus of this article is on random careless responding,1
which we define as nonsystematic responding with intentional disregard for
item content due to low test-taking motivation. The following sections pro-
vide a summary of previous research that has evaluated the impact of ran-
dom careless responding on test results as well as methods for both defining
and filtering random careless responding via response latencies (Wise &
Kong, 2005).
Study Objectives
As previous research has been limited in providing practical recommendations to
practitioners on when and how random careless responding should be dealt with,
the objectives of this study were to evaluate (a) the degree of random careless
responding that would have a practically significant impact on aggregated
scores, (b) the effectiveness of purifying biased aggregated scores for two filter-
ing procedures, examinee- and response-level filtering, and (c) the tenability of
the assumption that random careless responding is unrelated to ability, which
underlies the common practice of listwise deleting examinees deemed to be care-
less responders or unmotivated (i.e., examinee-level filtering). These study
objectives were addressed via two studies. In study 1 we investigated the follow-
ing research questions through data simulations:
Hypothesis: Underestimation is most impacted by two factors: (a) the total percent-
age of careless responses in the sample and (b) test difficulty. Specifically, it is
hypothesized that as the percentage of careless responses in the sample increases,
underestimation will also increase, but at a differential rate based on test difficulty.
That is, as the true probability of success for many examinees is high when a test is
easy, careless responding will lead to increased underestimation, but as the true
probability of success decreases (as is the case for more difficult tests) so does the
impact of careless responding on aggregated-scores.
Hypothesis: Both procedures will perform equally well when careless responding is
unrelated to ability, however, it is expected that examinee-level filtering will lead
to less accurate filtering when careless responding is related to ability. The reason
for the latter hypothesis is that when ability is related to careless responding, list-
wise deleting data from examinees deemed to be unmotivated (examinee-level fil-
tering) will alter the true proficiency distribution and result in an inaccurate
observed group score (Wise, Kingsbury, Thomason, & Kong, 2004). Response-
level filtering will be less susceptible to this issue as noncareless response data for
all examinees is included to compute expected scores. As long as the noncareless
responses serve as valid indicators of examinees’ true ability, the true proficiency
distribution should be minimally altered, and as a result, should lead to an accurate
observed group score.
Results from these studies have the potential to inform practitioners of the condi-
tions in which random careless responding is a major concern for group-level
score use and which approach is valid and effective for purifying biased scores.
STUDY 1
Method
To investigate the objectives of study 1, simulated data were analyzed. Next is a
description of how the data were generated, the independent variables that were
manipulated, and the analysis procedures that were implemented.
1 ¡ ci
Pi ðuÞ D ci C ; (1)
1 C expf ¡ 1:7ai ðu ¡ bi Þg
where Pi ðuÞ indicates the probability of getting item i correct; ai is the item dis-
crimination parameter for item i; bi is the item difficulty parameter for item i;
and ci is the pseudo-guessing parameter for item i. Simulee ability parameters
(u) were randomly sampled from N(0, 1), while generating item parameters were
sampled from the following distributions:
where bi varied across conditions. The probability for simulee j, item i was then
compared to a random number taken from runif (0, 1). A correct response was
WHEN AND HOW TO FILTER CARELESS RESPONSES 7
given if the probability was larger than the random number, otherwise an incor-
rect response was given.
Data generation for unmotivated simulees. The data for unmotivated
simulees were generated very similarly to the approach described; however, an
additional step was added. Specifically, after generating item responses based on
the 3PL model, various amounts of these responses were randomly replaced by
random careless responses for unmotivated simulees (refer to Wise & DeMars,
2006, for a similar approach). Within this study, random careless responses were
conceptualized as possessing a correct item response probability equal to
chance; Pi ðuÞ D 0.25 (assuming four-response options for each generated item).
The decision to implement random careless responses with a Pi ðuÞ D 0.25
instead of generating these responses based on the assumptions of the absorbing
state (e.g., Jin & Wang, 2014), decreasing effort (e.g., Goegebeur, De Boeck,
Wollack, & Cohen, 2008) or difficulty-based (e.g., Cao & Stokes, 2008) models
was based on the findings of Wise and Kingsbury (2016). In their study, the fit
of these three models was compared to the effort-moderated model (Wise &
DeMars, 2006) for data obtained from a low-stakes formative assessment admin-
istered to 285,230 examinees.
Three main conclusions can be drawn from the Wise and Kingsbury (2016)
study. First, examinee behavior does not correspond to the assumption that ran-
dom careless responding only occurs when an examinee receives a condition-
ally difficult item (assumed by the difficulty-based model). This finding was
supported by previous research from Wise (2006), who found that item diffi-
culty was nonsignificantly related to examinee effort. Second, patterns from
the data did not demonstrate that random careless responding occurs: (a) gradu-
ally across the test (assumed by the decreasing effort model) or (b) due to a
switch from a motivated to unmotivated state (assumed by the absorbing state
model). Third, assuming no clear pattern of random careless responding may
be most advantageous as such behavior appears to occur idiosyncratically
(assumed by the effort-moderated model). As a result, random careless
responses were generated with a chance probability. Once the replacement pro-
cess was completed, data for the motivated and unmotivated simulees were
combined for analysis.
These four independent variables and their respective levels were fully crossed,
which resulted in a 3 £ 3 £ 2 £ 3 design for a total of 54 conditions. Across con-
ditions, the number of simulees was constrained to 500, and to minimize sam-
pling error, each condition was replicated 100 times. As a result, the total
number of simulees within each condition was 50,000.
Analyses. The analytic procedures for the two research questions in this
study differed slightly and as a result, are described separately below.
Impact of random careless responding on aggregated scores. The analy-
ses for this part of the study were quite simple. In fact, to evaluate the
degree of biasing of random careless responding on aggregated scores, we
compared the observed average total score (included random careless
responses) with the true mean, which was based on the generating data for
all simulees without random careless responses. Comparisons were made by
conducting independent t-tests and calculating Cohen’s (d) effect sizes.
Based on Cohen’s (1988) recommendations, a d greater than 0.20 was clas-
sified as a nonnegligible difference.
Comparability of filtering procedures in purifying biased aggregated
scores. To compare the recovery of the true mean by examinee- and
response-level filtering, we first filtered examinees and then compared
observed (based on filtering) and true means as described. Examinee-level fil-
tering was implemented by listwise deleting data for any unmotivated simu-
lees. The employment of response-level filtering was a bit more involved as
IRT expected scores were computed based on a three-stage process. First,
random careless responses were recoded as blank. Such a decision was based
on the idea that a careless response is not a valid indicator of an examinee’s
ability and as such, any response option provided should not be scored as is
nor should the response automatically be scored as incorrect. Secondly, item
and person parameters for the recoded data were estimated based on the mod-
ified-3PL model using marginal Maximum Likelihood estimation in the R ltm
package (Rizopoulos, 2006; step 2). The modified-3PL model differed from
the generating 3PL model in that the c-parameter was constrained to 0.25
across all items for the purpose of gaining stable parameter estimates, which
has been recommended by Han (2012). As a result, realistic error was built
into the estimates by using a generating 3PL model and estimating data with
the modified-3PL model. The third step was to place the estimated person
and item parameters into the 3PL model to obtain an expected group mean
across all items as follows:
" #
X J X I
1 ¡ c
^ i D 1 j ^u j ; a^i ; b^i ; ci / D ci C
P.x
i
; (3)
jD1 iD1 1 C exp[ ¡ a^i .^u j ¡ b^i /]
Results
10
TABLE 1
Impact of Random Careless Responding on Aggregated-Scores
Percentage of Total Random Careless Random Careless
Within Amount Responding Unrelated Responding Related
Examinee of Random to Ability to Ability
Random Careless
Test Unmotivated Careless Responses “True” “True” Observed Observed Observed Observed
Difficulty Examinees Responses in Sample Mean SD Mean SD d Mean SD d
Easy 10% 10% 1.00% 21.82 1.06 21.67 1.06 -0.04 21.58 1.33 -0.03
(b D -1.0) 10% 25% 2.50% 21.86 1.17 21.53 1.16 -0.09 21.56 1.10 -0.08
10% 50% 5.00% 21.62 1.28 20.91 1.21 -0.18* 20.97 1.02 -0.16*
25% 10% 2.50% 21.87 1.17 21.53 1.14 -0.10 21.50 1.11 -0.09
25% 25% 6.25% 21.95 1.18 21.10 1.11 -0.24* 21.22 1.12 -0.20*
25% 50% 12.50% 21.85 1.11 20.07 0.98 -0.43* 20.09 1.07 -0.37*
50% 10% 5.00% 21.75 1.10 21.04 1.04 -0.20* 21.05 1.00 -0.17*
50% 25% 12.50% 21.95 1.28 20.27 1.15 -0.46* 20.37 0.95 -0.39*
50% 50% 25.00% 21.76 1.08 18.17 0.82 -0.85* 18.56 0.85 -0.69*
Moderate 10% 10% 1.00% 18.76 1.20 18.66 1.18 -0.03 18.44 1.13 -0.03
Difficulty 10% 25% 2.50% 18.90 1.10 18.63 1.07 -0.07 18.66 1.10 -0.06
(b D 0) 10% 50% 5.00% 18.83 1.10 18.26 1.04 -0.14* 18.27 1.21 -0.12*
25% 10% 2.50% 18.78 1.12 18.50 1.10 -0.07 18.36 1.20 -0.06
25% 25% 6.25% 18.55 1.19 17.90 1.14 -0.17* 18.05 1.19 -0.14*
25% 50% 12.50% 18.85 1.16 17.43 1.02 -0.34* 17.55 0.97 -0.28*
50% 10% 5.00% 18.74 1.11 18.19 1.07 -0.14* 18.08 1.14 -0.12*
50% 25% 12.50% 18.75 1.27 17.44 1.12 -0.34* 17.77 1.01 -0.29*
50% 50% 25.00% 18.72 1.20 15.92 0.91 -0.68* 16.45 0.84 -0.55*
11
(continued on next page)
12 RIOS ET AL.
TABLE 1
Impact of Random Careless Responding on Aggregated-Scores (Continued)
Difficult 10% 10% 1.00% 15.71 1.12 15.63 1.11 -0.02 15.62 1.19 -0.02
(b D 1.0) 10% 25% 2.50% 15.84 1.16 15.65 1.14 -0.05 15.37 1.15 -0.04
10% 50% 5.00% 15.73 1.20 15.31 1.14 -0.11 15.55 1.08 -0.09
25% 10% 2.50% 15.64 1.07 15.45 1.06 -0.05 15.45 1.03 -0.05
25% 25% 6.25% 15.53 1.07 15.06 1.00 -0.13 15.36 1.05 -0.11
25% 50% 12.50% 15.82 1.10 14.78 0.98 -0.27* 15.05 0.96 -0.22*
50% 10% 5.00% 15.83 1.25 15.40 1.19 -0.12 15.46 1.06 -0.09
50% 25% 12.50% 15.65 1.16 14.70 1.03 -0.26* 14.94 1.00 -0.21*
50% 50% 25.00% 15.62 1.03 13.59 0.77 -0.52* 14.09 0.89 -0.41*
Note. The direction of the effect sizes (d) was calculated by subtracting the true mean from the observed mean (with random careless responses). * p < .05
WHEN AND HOW TO FILTER CARELESS RESPONSES 13
FIGURE 1
The impact of random careless responding on aggregated scores by percentage of random
careless responses in the total sample, test difficulty, and relationship between ability and
random careless responding. Note that values greater than j :20 j were defined as of practi-
cal concern in this study (as represented by the horizontal dashed line). Furthermore, Easy
Unrelated D easy test difficulty, ability is unrelated to random careless responding; Easy
Related D easy test difficulty, ability is related to random careless responding; Moderate
Unrelated D moderate test difficulty, ability is unrelated to random careless responding; Mod-
erate Related D moderate test difficulty, ability is related to random careless responding; Hard
Unrelated D difficult test, ability is unrelated to random careless responding; Hard Related D
difficult test, ability is related to random careless responding.
not related to ability; however, the small differences were exacerbated for the
conditions in which the total amount of random careless responding in the sam-
ple was equal to 25%. For example, when random careless responding was unre-
lated to ability for this condition, the degree of underestimation was as much as
0.15 SDs greater than when it was related to ability. This was most likely due to
a greater random sampling of high ability simulees for the unrelated condition.
Easy 10% 10% 1.00% 21.82 1.06 21.82 1.07 0.00 21.81 1.04 0.00
(b D -1.0) 10% 25% 2.50% 21.86 1.17 21.86 1.18 0.00 21.86 1.16 0.00
10% 50% 5.00% 21.62 1.28 21.63 1.27 0.00 21.62 1.25 0.00
25% 10% 2.50% 21.87 1.17 21.87 1.18 0.00 21.90 1.16 0.01
25% 25% 6.25% 21.95 1.18 21.93 1.17 -0.01 21.95 1.16 0.00
25% 50% 12.50% 21.85 1.11 21.86 1.10 0.00 21.89 1.11 0.01
50% 10% 5.00% 21.75 1.10 21.79 1.09 0.01 21.80 1.11 0.01
50% 25% 12.50% 21.95 1.28 21.96 1.28 0.00 22.00 1.30 0.02
50% 50% 25.00% 21.76 1.08 21.71 1.11 -0.01 21.79 1.11 0.01
Moderate 10% 10% 1.00% 18.76 1.20 18.76 1.21 0.00 18.76 1.22 0.00
Difficulty 10% 25% 2.50% 18.90 1.10 18.92 1.10 0.00 18.90 1.12 0.00
(b D 0) 10% 50% 5.00% 18.83 1.10 18.83 1.10 0.00 18.82 1.13 0.00
25% 10% 2.50% 18.78 1.12 18.78 1.13 0.00 18.78 1.14 0.00
25% 25% 6.25% 18.55 1.19 18.55 1.20 0.00 18.54 1.23 0.00
25% 50% 12.50% 18.85 1.16 18.85 1.16 0.00 18.86 1.19 0.00
50% 10% 5.00% 18.74 1.11 18.74 1.11 0.00 18.76 1.15 0.00
50% 25% 12.50% 18.75 1.27 18.77 1.29 0.00 18.75 1.29 0.00
50% 50% 25.00% 18.72 1.20 18.72 1.24 0.00 18.71 1.24 0.00
Percentage of Total
Within Amount Examinee-Level Response-Level
Examinee of Random Filtering Filtering
Random Careless
Test Unmotivated Careless Responses “True” “True” Observed Observed Observed Observed
Difficulty Examinees Responses in Sample Mean SD Mean SD d Mean SD d
Difficult 10% 10% 1.00% 15.71 1.12 15.71 1.13 0.00 15.65 1.16 -0.02
(b D 1.0) 10% 25% 2.50% 15.84 1.16 15.83 1.15 0.00 15.78 1.21 -0.02
10% 50% 5.00% 15.73 1.20 15.74 1.18 0.00 15.65 1.25 -0.02
25% 10% 2.50% 15.64 1.07 15.65 1.08 0.00 15.58 1.12 -0.02
25% 25% 6.25% 15.53 1.07 15.53 1.07 0.00 15.45 1.11 -0.02
25% 50% 12.50% 15.82 1.10 15.80 1.10 0.00 15.74 1.15 -0.03
50% 10% 5.00% 15.83 1.25 15.82 1.25 0.00 15.75 1.30 -0.03
50% 25% 12.50% 15.65 1.16 15.68 1.18 0.01 15.55 1.23 -0.03
50% 50% 25.00% 15.62 1.03 15.61 1.08 0.00 15.50 1.09 -0.04
Note. The direction of the effect sizes (d) was calculated by subtracting the “true” mean from the filtered. As a result, a positive effect size would indicate
that the filtering procedure overestimated the “true” mean or artificially inflated the mean after removing unmotivated examinees (examinee-level filtering) or
flagged responses (response-level filtering).
* p < .05
15
16 RIOS ET AL.
FIGURE 2
Bias on true mean scores when employing examinee-level filtering by percentage of
unmotivated simulees, test difficulty, and relationship between random careless respond-
ing and ability.
the difference ranged from 0.01 to 0.02 SDs, while ranging from 0.02 to 0.04 for
a difficult test; however, these filtered means were not statistically or practically
different from the true mean (Figure 3). As a result, when unmotivated simulees
were sampled from across the ability distribution, both examinee- and response-
level filtering were found to perform equally well.
Table 3 presents the comparative results between filtering procedures for
when unmotivated simulees were assumed to be low ability. As opposed to the
previous findings, examinee-level filtering was found to predominately have
higher rates of inflation when compared to response-level filtering (Figures 2
and 3). Specifically, as the percentage of unmotivated simulees increased, the
FIGURE 3
Bias on true mean scores when employing response-level filtering by percentage of random
careless responses in the total sample, test difficulty, and relationship between random care-
less responding and ability.
TABLE 3
Comparison of Filtering Approaches on Purifying Aggregated-Scores When Random Careless Responding is Related to Ability
Examinee-Level Response-Level
Percentage of Total Amount
Filtering Filtering
Within Examinee of Random
Test Unmotivated Random Careless Careless Responses “True” “True” Observed Observed Observed Observed
Difficulty Examinees Responses Guesses in Sample Mean SD Mean SD d Mean SD d
Easy 10% 10% 1.00% 21.82 1.06 21.85 1.34 0.04 21.69 1.32 -0.00
(b D -1.0) 10% 25% 2.50% 21.86 1.17 22.03 1.13 0.04 21.87 1.11 0.00
10% 50% 5.00% 21.62 1.28 21.76 1.09 0.04 21.64 1.06 0.01
25% 10% 2.50% 21.87 1.17 22.30 1.14 0.13* 21.84 1.12 -0.00
25% 25% 6.25% 21.95 1.18 22.45 1.20 0.13* 22.00 1.16 0.01
25% 50% 12.50% 21.85 1.11 22.18 1.19 0.13* 21.78 1.19 0.03
50% 10% 5.00% 21.75 1.10 23.08 1.04 0.39* 21.71 1.05 0.01
50% 25% 12.50% 21.95 1.28 23.34 1.04 0.40* 21.97 1.07 0.02
50% 50% 25.00% 21.76 1.08 23.17 1.10 0.39* 21.98 1.13 0.07
Moderate 10% 10% 1.00% 18.76 1.20 18.70 1.14 0.04 18.53 1.17 -0.00
Difficulty 10% 25% 2.50% 18.90 1.10 19.07 1.11 0.04 18.90 1.14 0.00
(b D 0) 10% 50% 5.00% 18.83 1.10 18.92 1.26 0.04 18.79 1.28 -0.01
25% 10% 2.50% 18.78 1.12 19.09 1.25 0.13* 18.61 1.26 -0.00
25% 25% 6.25% 18.55 1.19 19.12 1.29 0.13* 18.64 1.29 0.01
25% 50% 12.50% 18.85 1.16 19.27 1.13 0.13* 18.86 1.13 0.03
50% 10% 5.00% 18.74 1.11 20.06 1.22 0.38* 18.56 1.22 -0.00
50% 25% 12.50% 18.75 1.27 20.44 1.19 0.39* 19.01 1.18 0.02
50% 50% 25.00% 18.72 1.20 20.40 1.14 0.38* 19.16 1.16 0.07
17
18
TABLE 3
Comparison of Filtering Approaches on Purifying Aggregated-Scores When Random Careless Responding is Related to Ability (Continued)
Examinee-Level Response-Level
Percentage of Total Amount
Filtering Filtering
Within Examinee of Random
Test Unmotivated Random Careless Careless Responses “True” “True” Observed Observed Observed Observed
Difficulty Examinees Responses Guesses in Sample Mean SD Mean SD d Mean SD d
Difficult 10% 10% 1.00% 15.71 1.12 15.85 1.19 0.04 15.62 1.24 -0.02
(b D 1.0) 10% 25% 2.50% 15.84 1.16 15.66 1.19 0.04 15.46 1.22 -0.02
10% 50% 5.00% 15.73 1.20 16.07 1.14 0.04 15.88 1.17 0.01
25% 10% 2.50% 15.64 1.07 16.11 1.10 0.13* 15.56 1.10 -0.02
25% 25% 6.25% 15.53 1.07 16.25 1.14 0.13* 15.73 1.16 0.01
25% 50% 12.50% 15.82 1.10 16.41 1.10 0.12* 15.95 1.14 0.00
50% 10% 5.00% 15.83 1.25 17.22 1.20 0.37* 15.77 1.15 -0.01
50% 25% 12.50% 15.65 1.16 17.15 1.22 0.37* 15.72 1.19 0.01
50% 50% 25.00% 15.62 1.03 17.24 1.25 0.38* 15.95 1.19 0.05
Note. The direction of the effect sizes (d) was calculated by subtracting the “true” mean from the filtered mean. As a result, a positive effect size would
indicate that the filtering procedure overestimated the “true” mean or artificially inflated the mean after removing unmotivated examinees (examinee-level fil-
tering) or flagged responses (response-level filtering). * p < .05
WHEN AND HOW TO FILTER CARELESS RESPONSES 19
inflation of the true mean increased. As an example, when the total amount of
unmotivated simulees was equal to 10%, the filtered mean was .04 SDs (p >
0.05) greater than the true mean; however, when the percentage of unmotivated
simulees increased to 25% and 50%, the true mean was inflated by as much as
0.13 (p < 0.05) and 0.40 (p < 0.05) SDs, respectively (Table 3). These findings
were robust to the percentage of within-simulee random careless responding and
test difficulty for two reasons: (1) simulees were filtered regardless of whether
they rapid guessed on 10% or 50% of the items, and (2) valid responses were
completely removed from unmotivated simulees and as a result, the difference
in simulee ability and test difficulty had no impact. In contrast, the degree of
score inflation for response-level filtering was relatively consistent as when
unmotivated simulees were randomly sampled from across the ability distribu-
tion. That is, at most, when applying this type of filtering, the aggregated score
was overestimated by 0.07 SDs when random careless responses comprised 25%
of the total response sample (Figure 3); however, across conditions, there was
no significant bias on the aggregated scores as was seen with examinee-level fil-
tering. As a result, response-level filtering was found to be robust to the condi-
tion of random careless responding being related to ability.
Although this study demonstrated that response-level filtering provided fil-
tered means that were not significantly biased under conditions of motivation
being un/related to ability, it is limited in two ways. First, it makes the assump-
tion that any response not flagged can be used for the accurate estimation of
examinee ability. However, such an assumption is difficult to evaluate, particu-
larly when only response times are available (i.e., other evidence sources of cog-
nition, such as measures of eye-tracking, electroencephalogram [EEG], and
emotional state, are needed). Second, there are testing programs that may rely
solely on the use of classical test theory (CTT) for scoring and as result, cannot
employ response-level filtering as it requires the use of IRT. As a result, many
practitioners may see the use of examinee-level filtering as the only option for
purifying biased aggregated scores. Consequently, it is vital to assess whether
the assumption underlying this filtering approach is tenable in practice.
STUDY 2
results may largely be due to the statistical artifact of possessing highly negatively
skewed RTE distributions, which artificially attenuate the true correlation coeffi-
cients. As a result, it is argued that the validity of examinee-level filtering is best
examined when comparing performance differences between examinees deemed to
be motivated and unmotivated. To this end, a few studies have examined mean
score differences on prior achievement measures between motivation groups (e.g.,
Allen et al., 2016; Rios et al., 2014; Wise & Kong, 2005). As an example, Wise
and Kong found that examinees who randomly careless responded on 20% or more
of items actually scored slightly lower than their engaged counterparts on the SAT
(d D ¡0.33). Similarly, Allen and colleagues (2016) found that disengaged examin-
ees on a college writing task possessed significantly lower self-reported ACT scores
when compared to engaged examinees (d D ¡0.52). The results from these two
studies may suggest a relationship between test-taking motivation and ability, which
calls for further examination of the issue, particularly as this relationship impacts the
decision of which filtering procedure to employ.
As a result of the limited research in this area and the fact that we showed
examinee-level filtering to lead to large bias when unmotivated examinees are
predominately of low ability, the objective of this study was to evaluate perfor-
mance differences between motivated and unmotivated examinees on expected
performance for a large-scale low-stakes assessment and prior achievement mea-
sured by the SAT. Such an analysis should provide some evidence regarding the
tenability of the assumption underlying examinee-level filtering in practice.
Method
and response-level filtering. Although the true mean was not known as real data
were evaluated, if the examinee-level filtered mean was higher than that of
response-level filtering, the results from study 1 would suggest that such a trend
may largely be artificial inflation of the mean due to the removal of low ability
examinees. To support the results from the first analysis, motivation groups were
compared in terms of expected group means on the EPP and observed means on
the SAT. However, before conducting either analysis, random careless responses
and in turn, unmotivated examinees had to be first classified. For the purposes of
study 2, response latency was used an indicator of random careless responses as
true random careless responses were not known as in study 1.
Three response time threshold procedures were implemented to classify random
careless responding: the three-second (3SEC), normative threshold (three levels:
15% [NT15], 20% [NT20], 25% [NT25]), and visual inspection (VI) procedures.
Each one of these procedures functions by defining a random careless response as
any response given in less time than that set by the response time threshold, which
differs across procedures. Specifically, a random careless response is defined as any
response given in less than (a) three seconds for the 3SEC procedure and (b) a per-
centage (15%, 20%, or 25%) of the mean item response time for the NT procedure.
For example, if the NT20 procedure (using 20% of the mean item response time as
the threshold) was employed and the mean item response time was 25 seconds, a
random careless response would be defined as a response provided in less than five
seconds (threshold: 25 seconds £ 0.20 D 5 seconds). In contrast, the VI proce-
dure defines a response time threshold as the time in which the nonsolution (i.e.,
examinees tend to respond very quickly without processing the item content) and
solution behavior (i.e., examinees try to seek the correct response; Kong et al.,
2007) response time distributions meet. For example in Figure 4, a random
FIGURE 4
An example of a response time frequency distribution with rapid guessing.
22 RIOS ET AL.
FIGURE 5
An example of a response time frequency distribution with no clear pattern of rapid guessing.
careless response would be defined as a response given in less than seven seconds.
However, a clear delineation between solution and nonsolution behavior distribu-
tions is not always clear (see Figure 5 for an example). When this is the case, ran-
dom careless responding cannot be clearly defined and identified, which was the
case for 36 of the 108 items in the present dataset. As a result, these items were
dropped from the analysis for the purpose of comparing the results across the
3SEC, NT, and VI threshold procedures.
Once random careless responses were flagged, examinees were classified
as unmotivated for each flagging procedure by using response time effort
(RTE; Wise & Kong, 2005), which was calculated as:
!
X I
RTEj D 1 ¡ .SBij /=I ; (4)
iD1
where RTEj is equal to response time effort for examinee j, I is equal to the total
number of items, and is equal to solution behavior for examinee j on item i. For, a
value of 1 was given to examinee j on item i for any items identified as a random care-
less response, while a value of 0 was given if examinee j spent more time answering
item i than the threshold set by the respective flagging method. Traditionally, data
from examinees are arbitrarily removed if their RTE is less than 0.90; however, to
assess the adequacy of various RTE thresholds for examinee-level filtering, we set
thresholds from 0.50 to 0.90 in increments of 0.10, but, for brevity’s sake, we only
present results for a RTE threshold of 0.20 following the recommendations of Hauser
and Kingsbury (2009)2.
2
If interested, the reader can contact the first author for full results.
WHEN AND HOW TO FILTER CARELESS RESPONSES 23
Results
TABLE 4
Results of Rapid Response Flagging and Examinee-Level Filtering to EPP Data
3SEC 4.61 9.75 .22 6.41 1202 37.48 10.81 .12* 1322 36.26 9.93 .00
NT15 5.64 10.85 .21 7.84 1162 37.80 10.80 .15* 1322 36.31 9.89 .01
NT20 6.81 11.74 .21 9.47 1118 38.24 10.68 .19* 1322 36.37 9.85 .01
NT25 7.82 12.39 .22 10.87 1081 38.57 10.66 .22* 1322 36.41 9.83 .02
VI 7.96 12.04 .22 11.05 1078 38.59 10.66 .22* 1322 36.43 9.82 .02
Note. The expected proportion correct for flagged responses was chance (.25). The percent of total
responses column provides the percentage of careless responses in the total sample based on the
respective flagging procedure. Unfiltered group mean (SD) D 36.17 (11.18). d was based on the dif-
ference between the unfiltered and filtered means. As a result, a positive d indicates that after filtering
the mean was higher than when unfiltered. * p < .05
24 RIOS ET AL.
responses when comparing the most liberal (3SEC) and strict (VI) rules.
Across procedures, the proportion correct for the flagged responses was
close to that expected by chance, which provided some evidence that the
careless responding identified may have been random. However, based on
the findings from study 1, the actual impact of careless responding on
aggregated scores in this sample may not have been significant as (a) this
assessment was of moderate difficulty (the filtered mean item difficulty
ranged from 0.27 to 0.40 [differed by flagging procedure] based on a modi-
fied 3PL model), and (b) the percent of total careless responses in the
sample ranged from as little as 6.41% (3SEC) to 11.05% (VI).
TABLE 5
Differences in Expected EPP and SAT Performance by Motivation Group
3SEC 1202 37.42 9.59 120 25.49 5.84 1.50* 1047 966.49 143.50 97 921.44 117.10 .34*
NT15 1162 37.75 9.54 160 26.21 5.81 1.46* 1012 967.89 144.25 127 917.08 115.05 .38*
NT20 1118 38.18 9.42 204 26.71 5.93 1.45* 975 971.67 143.98 159 915.03 115.17 .43*
NT25 1081 38.50 9.38 241 27.11 5.84 1.45* 939 974.74 144.65 193 907.09 115.47 .51*
VI 1078 38.53 9.36 244 27.17 5.94 1.44* 939 974.31 144.61 193 908.08 114.10 .50*
Note. d is the effect size when comparing the difference between engaged and disengaged
expected group means. Results are based on a RTE threshold of .20. * p < .05
WHEN AND HOW TO FILTER CARELESS RESPONSES 25
DISCUSSION
Limitations
In general, the findings from this study are limited to random careless
responding in relation to multiple-choice assessments and cannot be general-
ized to other forms of responding that do not accurately reflect an examinee’s
skill or ability (e.g., content responsive faking, inattentiveness, linguistic
incompetence, and LongString responding [providing the same response con-
secutively across a number of items]) and/or surveys and questionnaires;
however, there has been extensive previous work conducted in these areas
(e.g., Curran, 2015; Huang, Liu, & Bowling, 2015; Meade & Craig, 2011). In
addition to this general limitation, there are study-specific limitations that
should be noted. For study 1, one of the major limitations was regarding how
response-level filtering was employed. Specifically, data were generated with
a 3PL model, but were estimated using a constrained 3PL model in which the
c-parameter was set equal to 0.25 across all items. As a result, the estimated
item parameters (including a and b) were biased due to the different models
used for data generation and estimation, which may have had an impact on
the recovery of the true mean by the response-level filtering procedure. Con-
sequently, this filtering procedure may have been more accurate than demon-
strated. Nevertheless, such item parameter estimation bias is to be expected
when administering multiple-choice items as the c-parameter may be difficult
to estimate (Han, 2012).
For study 2, one limitation was that results were based on data from one
institution administered the EPP, and therefore the findings may not general-
ize to examinees in other institutions or other types of tests. An additional
limitation is that the expected means were computed based on the overarch-
ing assumption that nonflagged responses were valid indicators of careless
WHEN AND HOW TO FILTER CARELESS RESPONSES 27
Future Research
There are three important areas of future research. The first is related to the
development and/or validation of procedures to identify careless responding.
As noted in the previous sections, there has been previous work related to
identifying methods in both survey and cognitive assessments; however,
when it comes to methods that evaluate response latencies as a proxy of
careless responding, there is a need to use additional cross-validation meth-
ods to ensure that the criteria used for classification are accurate. To this
end, it is suggested that a multimodal approach to validating response
latency approaches be taken by evaluating examinee behavior, such as eye
movement and key-stroke behavior. The latter approach may be particularly
important for identifying careless responding on constructed response items
as there has been limited research on this topic (e.g., Steedle, 2014).
A second important area of future research is related to the development
of new filtering approaches. This study has illustrated that the biasing effect
of careless responding on aggregated scores is largely dependent on the
overall percentage of careless responses in the sample. Yet, the majority of
literature has focused on the degree of within-examinee careless responding,
which is reflected in the development and popularity of the examinee-level
filtering procedure. Although this study has shown the effectiveness of
employing IRT to compute expected scores, what happens if a testing pro-
gram does not employ IRT (i.e., cannot use response-level filtering)? Fur-
thermore, what can be done if the assumption underlying examinee-level
filtering is found to be untenable? In such a case, the only option may be to
allow the random careless responses to remain in the total sample. Conse-
quently, there is a need to develop new filtering procedures that do not rely
on IRT and can limit the biasing seen in examinee-level filtering when abil-
ity is related to random careless responding.
Lastly, greater focus should be placed on approaches to handling careless
responding a priori. There is a rich literature related to the design and
implementation of interventions geared toward improving examinee test-tak-
ing motivation, which spans back to the 1940s (Rothe, 1947). These
28 RIOS ET AL.
Implications
Overall, the results from our studies provide a number of implications for practi-
tioners. First, although the removal of any random careless response is desirable,
it must be considered whether the degree of random careless responses has a
practically significant bias on aggregated scores and warrants the need for filter-
ing. This evaluation should not be based on the number of unmotivated examin-
ees (i.e., examinees with amounts of careless responses deemed to be too large),
but rather on the combination between test difficulty and total percent of careless
responses in the sample. Second, if the percent of careless responding is found to
be of significant concern, it is recommended that response-level filtering should
be applied to improve data quality due to its less restrictive underlying assump-
tions and superior filtering accuracy. If one is unable to employ response-level
filtering due to a lack of knowledge concerning IRT or testing program con-
straints (e.g., a program only uses CTT), examinee-level filtering should be
employed; however, before doing so, the underlying assumption that ability is
unrelated to random careless responding should be evaluated. This recommenda-
tion relates to the fact that when this assumption is untenable, examinee-level fil-
tering can create more bias than leaving the data unfiltered. If this assumption
cannot be tested due to a lack of prior achievement data or the assumption is
found to be untenable, one solution could be to iteratively remove examinees
with the highest percentages of within-examinee careless responding until the
overall percent of careless responses in the sample is within an acceptable range
(refer to the findings from study 1). Such an approach may allow for filtering
careless responses while mitigating artificial inflation from removing low ability
examinees. These recommendations, when applied correctly, should assist prac-
titioners with managing the threat of low test-taking motivation (from a post-hoc
WHEN AND HOW TO FILTER CARELESS RESPONSES 29
ACKOWLEDGMENTS
The authors would like to thank Brent Bridgeman, Guangming Ling, and James
Carlson from the Educational Testing Service for their comments on an earlier draft.
REFERENCES
Allen, L. K., Mills, C., Jacovina, M. E., Crossley, S., D’Mello, S., & McNamara, D. S. (2016, April).
Investigating boredom and engagement during writing using multiple sources of information: The
essay, the writer, and keystrokes. Paper presented at the 6th International Learning Analytics and
Knowledge Conference, Edinburgh, Scotland.
American Educational Research Association. (2000). Position statement of the American Educational
Research Association concerning high-stakes testing in PreK–12 education. Educational
Researcher, 29, 24–25.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (2014). Standards for educational and psychological test-
ing (6th ed.). Washington, DC: American Educational Research Association.
Baumert, J., & Demmrich, A. (2001). Test motivation in the assessment of student skills: The effects of
incentives on motivation and performance. European Journal of Psychology of Education, 16, 441–462.
Braun, H., Kirsch, I., & Yamamoto, K. (2011). An experimental study of the effects of monetary
incentives on performance on the 12th-grade NAEP reading assessment. Teachers College Record,
11, 2309–2344.
Buchholz, J., & Rios, J. A. (2014, July). Examining response time threshold procedures for the identi-
fication of rapid-guessing behavior in small samples. Paper presented at the 9th Conference of the
International Test Commission, San Sebastian, Spain.
Cao, J., & Stokes, S. L. (2008). Bayesian IRT guessing models for partial guessing behaviors.
Psychometrika, 73, 209–230.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Erlbaum.
Covington, M. V., & Beery, R. (1976). Self-worth and school learning. New York: Holt, Rinehart, &
Winston.
Curran, P. G. (2015). Methods for the detection of carelessly invalid responses in survey data. Jour-
nal of Experimental Social Psychology. Advance online publication. http://dx.doi.org/10.1016/j.
jesp.2015.07.006
De Ayala, R. J., Plake, B., & Impara, J. C. (2001). The effect of omitted responses on the accuracy of
ability estimation in item response theory. Journal of Educational Measurement, 38, 213–234.
Debeer, D., Buchholz, J., Hartig, J., & Janssen, R. (2014). Student, school, and country differences in
sustained test-taking effort in the 2009 PISA reading assessment. Journal of Educational and
Behavioral Statistics, 39, 502–523.
DeMars, C. E., & Wise, S. L. (2010). Can differential rapid-guessing behavior lead to differential
item functioning? International Journal of Testing, 10, 207–229.
Goegebeur, Y., De Boeck, P., Wollack, J. A., & Cohen, A. S. (2008). A speeded item response model
with gradual process change. Psychometrika, 73, 65–87.
Goldhammer, F., Martens, T., Christoph, G., & L€udtke, O. (2016). Test-taking engagement in PIAAC
(OECD Education Working Papers, No. 133). Paris, France: OECD Publishing.
30 RIOS ET AL.
Guo, H., Rios, J. A., Haberman, S., Liu, O. L., Wang, J., & Paek, I. (2016). A new procedure for
detection of students’ rapid guessing responses using response time. Applied Measurement in Edu-
cation. Advance online publication. doi:10.1080/08957347.2016.1171766
Han, K. T. (2012). Fixing the c parameter in the three-parameter logistic model. Practical Assess-
ment, Research & Evaluation, 17, 1–24.
Hauser, C., & Kingsbury, G. G. (2009, April). Individual score validity in a modest-stakes adaptive
educational testing setting. Paper presented at the meeting of the National Council on Measure-
ment in Education, San Diego, CA.
Huang, J. L., Curran, P. G., Keeney, J., Poposki, E. M., & DeShon, R. P. (2011). Detecting and deterring
insufficient effort responding to surveys. Journal of Business and Psychology, 27, 99–114.
Huang, J. L., Liu, M., & Bowling, N. A. (2015). Insufficient effort responding: Examining an insidi-
ous confound in survey data. Journal of Applied Psychology, 3, 828–845.
Jagacinski, C. M., & Nicholls, J. G. (1990). Reducing effort to protect perceived ability: “They’d do
it, but I wouldn’t.” Journal of Educational Psychology, 82, 15–21.
Jin, K., & Wang, W. (2014). Item response theory models for performance decline during testing.
Journal of Educational Measurement, 51, 178–200.
Kong, X. J. (2007). Using response time and the effort-moderated model to investigate the effects of
rapid guessing on estimation of item and person parameters (Unpublished doctoral dissertation).
Harrisonburg, VA: James Madison University.
Kong, X. J., Wise, S. L., & Bhola, D. S. (2007). Setting the response time threshold parameter to dif-
ferentiate solution behavior from rapid-guessing behavior. Educational and Psychological Mea-
surement, 67, 606–619.
Lee, Y.-H., & Jia, Y. (2014). Using response time to investigate students’ test-taking behaviors in a
NAEP computer-based study. Large-scale Assessments in Education, 2, 1–24.
Ling, G., Attali, Y., Finn, B., & Stone, E. (2016). Is a computerized adaptive test more motivating
than a fixed item test? Manuscript submitted for publication.
Liu, O. L. (2008). Measuring learning outcomes in higher education using the Measure of Academic
Proficiency and Progress (MAPP) (ETS RR-08-47). Princeton, NJ: Educational Testing Service.
Liu, O. L., Rios, J. A., & Borden, V. (2015). Does motivational instruction affect college students’ per-
formance on low-stakes assessment? An experimental study. Educational Assessment, 20, 79–94.
Meade, A. W., & Craig, S. B. (2011, April). Identifying random careless responses in survey data.
Paper presented at the 26th annual meeting of the Society for Industrial and Organizational Psy-
chology, Chicago, IL.
Ortner, T. M., Weißkopf, E., Koch, T. (2014). I will probably fail: Higher ability students’ motiva-
tional experiences during adaptive achievement testing. European Journal of Psychological
Assessment, 30, 48–56.
Osborne, J. W., & Blanchard, M. R. (2011). Random responding from participants is a threat to the
validity of social science research results. Frontiers in Psychology, 1, 1–7.
Penk, C., & Schipolowski, S. (2015). Is it all about value? Bringing back the expectancy component
to the assessment of test-taking motivation. Learning and Individual Differences, 42, 27–35.
Rios, J. A., Liu, O. L., & Bridgeman, B. (2014). Identifying unmotivated examinees on student learn-
ing outcomes assessment: A comparison of two approaches. New Directions for Institutional
Research, 161, 69–82.
Rizopoulos, D. (2006). ltm: An R package for latent variable modeling and item response theory
analyses. Journal of Statistical Software, 17, 1–25.
Rothe, H. F. (1947). Distribution of test scores in industrial employees and applicants. Journal of
Applied Psychology, 31, 480–483.
Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two-state mixture model:
A new method of measuring speededness. Journal of Educational Measurement, 34, 213–232.
WHEN AND HOW TO FILTER CARELESS RESPONSES 31