CR 6 Eng

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

International Journal of Testing

ISSN: 1530-5058 (Print) 1532-7574 (Online) Journal homepage: http://www.tandfonline.com/loi/hijt20

Evaluating the Impact of Careless Responding


on Aggregated-Scores: To Filter Unmotivated
Examinees or Not?

Joseph A. Rios, Hongwen Guo, Liyang Mao & Ou Lydia Liu

To cite this article: Joseph A. Rios, Hongwen Guo, Liyang Mao & Ou Lydia Liu (2016): Evaluating
the Impact of Careless Responding on Aggregated-Scores: To Filter Unmotivated Examinees or
Not?, International Journal of Testing, DOI: 10.1080/15305058.2016.1231193

To link to this article: http://dx.doi.org/10.1080/15305058.2016.1231193

Published online: 04 Oct 2016.

Submit your article to this journal

Article views: 41

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=hijt20

Download by: [La Trobe University] Date: 30 October 2016, At: 09:36
International Journal of Testing, 0: 1–31, 2016
Copyright Ó Eductional Testing Service
ISSN: 1530-5058 print / 1532-7574 online
DOI: 10.1080/15305058.2016.1231193

Evaluating the Impact of Careless


Responding on Aggregated-Scores: To Filter
Unmotivated Examinees or Not?
Joseph A. Rios, Hongwen Guo, Liyang Mao, and Ou Lydia Liu
Educational Testing Service, USA

When examinees’ test-taking motivation is questionable, practitioners must


determine whether careless responding is of practical concern and if so, decide on
the best approach to filter such responses. As there has been insufficient research
on these topics, the objectives of this study were to: a) evaluate the degree of
underestimation in the true mean when careless responses are present, and b)
compare the effectiveness of two filtering procedures in purifying biased
aggregated-scores. Results demonstrated that: a) the true mean was underestimated
by around 0.20 SDs if the total amount of careless responses exceeded 6.25%,
12.5%, and 12.5% for easy, moderately difficult, and difficult tests, respectively,
and b) listwise deleting data from unmotivated examinees artificially inflated the
true mean by as much as .42 SDs when ability was related to careless responding.
Findings from this study have implications for when and how practitioners should
handle careless responses for group-based low-stakes assessments.

Keywords: test taking, motivation, careless responding, rapid guessing, low-stakes


testing, validity

Test-taking motivation, which is defined as the willingness to engage as


well as invest effort and persistence in working on test items (Baumert &
Demmrich, 2001), has been a growing concern as low-stakes assessments
have been increasingly used for purposes such as international comparative
studies and accountability or accreditation in higher education. In such con-
texts, low test-taking motivation may serve as an extraneous factor to exam-
inees accurately demonstrating the skills, abilities, or proficiencies being
assessed, which threatens the validity of test-score interpretations. Although
low-stakes assessments typically have no consequences for examinees,

Correspondence should be sent to Joseph Rios, Educational Testing Service, 660 Rosedale Road,
Princeton, NJ 08540. E-mail: jrios@ets.org
Liyang Mao is now affiliated with IXL Learning.
2 RIOS ET AL.

results that do not accurately reflect examinees’ ability may provide mis-
leading information for important stakeholders such as policy makers and
institutional administrators (American Educational Research Association
[AERA], 2000). As a result, it is important to examine whether examinees
are sufficiently serious when taking the test (AERA, American Psychologi-
cal Association, & National Council on Measurement in Education, 2014).
To this end, the focus of this article is on random careless responding,1
which we define as nonsystematic responding with intentional disregard for
item content due to low test-taking motivation. The following sections pro-
vide a summary of previous research that has evaluated the impact of ran-
dom careless responding on test results as well as methods for both defining
and filtering random careless responding via response latencies (Wise &
Kong, 2005).

The Impact of Low Examinee Motivation on Test Results


Numerous studies have illustrated that low examinee motivation puts into ques-
tion the validity of score-based inferences as scores from such contexts may
not be reflective of the underlying skills or abilities of the examinee (e.g.,
Wise & DeMars, 2005). As an example, random careless responding has been
found to lead to biased individual-level ability estimates (e.g., De Ayala, Plake,
& Impara, 2001) and other measurement properties, such as item parameter
estimates (e.g., DeMars & Wise, 2010; Wise & DeMars, 2006), reliability esti-
mates (e.g., Sundre & Wise, 2003; Wise & DeMars, 2009; Wise & Kong,
2005), and correlations with external variables (e.g., Wise, 2009). However, as
low stakes tests tend to be administered for the purpose of evaluating the
impact of educational programs at the group-level (e.g., Programme for Inter-
national Student Assessment, Trends in International Mathematics and Science,
Assessment of Higher Education Learning Outcomes; Wise, 2015), random
careless responding may potentially impact score-based inferences via aggre-
gated scores in terms of treatment effects (Osborne & Blanchard, 2011),
achievement gains (Wise & DeMars, 2010), evaluation ratings of school per-
sonnel when assessing student growth (Wise, Ma, Cronin, & Theaker, 2013),
and country-level comparisons (Debeer, Buchholz, Hartig, & Janssen, 2014;
Goldhammer, Martens, Christoph, & L€ udtke, 2016). However, this research
area is incomplete as these studies have been restricted to applied data analyses
and as a result, the findings may have been confounded by the procedure for
1
A number of different terms have been used throughout the literature to describe unmotivated
examinee responding; however, we chose the term random careless responding due to its continuous
popularity in the literature. For a brief review of terminologies, the reader is referred to Huang,
Curran, Keeney, Poposki, and DeShon (2011).
WHEN AND HOW TO FILTER CARELESS RESPONSES 3

identifying and removing random careless responses. Consequently, the true


degree of bias on aggregated scores is still largely unknown. Clearly, all care-
less responding is undesirable; however, better recommendations are needed
for practitioners to guide them as to when it is necessary to reduce the bias of
such responses. In the next section, we describe approaches to dealing with
this issue.

Procedures for Filtering Random Careless Responses


One strategy that has been applied to address the threat of low test-taking
motivation is to utilize response latency as an indicator of random careless
responding to (a) flag and (b) filter (i.e., remove) invalid responses. The
assumption of this approach is that responses are invalid if given in an
amount of time that does not allow the examinee to fully read and compre-
hend the item stem and response options (commonly referred to as rapid
guessing or rapid responding; Schnipke & Scrams, 1997). Responses
detected using this approach have been shown to on average possess a pro-
portion correct around chance (Wise & Kong, 2005), which would support
the idea that examinees providing rapid responses are doing so in a manner
with intentional disregard for item content.
A number of different approaches have been developed for flagging ran-
dom careless responses via response latency: (a) a common criterion across
all items (Wise, Kingsbury, Thomason, & Kong, 2004); (b) item surface
information (e.g., combination of the number of characters in an item as
well as whether ancillary information was provided [Wise & Kong, 2005];
the number of words expected to be read per minute by examinees [Buch-
holz & Rios, 2014]); (c) visual inspection of response time frequency distri-
butions (Wise, 2006); (d) statistical estimation using a two-state mixture
model (Kong, Wise, & Bhola, 2007); (e) a percentage of the average item
response time (Wise & Ma, 2012); and (f) inspection of both response time
and response accuracy distributions (Guo et al., 2016; Lee & Jia, 2014).
Although a number of methods have been developed to flag random careless
responses when using response latency, previous research has demonstrated
practically negligible differences in terms of mean score differences and
convergent (i.e., correlations between mean test scores after removing rapid
guesses and measures of test performance as well as self-reported effort)
and discriminant (i.e., correlations between mean test scores after removing
rapid guesses and prior achievement) validity coefficients (Buchholz &
Rios, 2014; Kong et al., 2007). Regardless of the procedure employed, the
practitioner is next confronted with how to purify scores deemed to be a
random careless response. Two of the most popular approaches are exam-
inee- and response-level filtering.
4 RIOS ET AL.

Examinee-Level Filtering. Examinee-level filtering consists of listwise


deleting an examinee’s data if the number of random careless responses for
that examinee exceeds an a priori cut-off. Although Hauser and Kingsbury
(2009) found that proficiency estimates and their accompanying standard
errors were not adversely impacted if the percentage of random careless
responses did not exceed 20%, examinees have been traditionally classified
as unmotivated if the number of random careless responses is greater than
10% of the items on the test of interest (e.g., Wise & Kong, 2005). In gen-
eral, numerous studies have applied examinee-level filtering and have found
that both mean scores and estimates of convergent validity improved (Kong
et al., 2007; Sundre & Wise, 2003; Wise & DeMars, 2005; Wise, Kings-
bury, Thomason, & Kong, 2004; Wise & Kong, 2005; Wise, Wise, & Bhola,
2006). Although this filtering procedure has been found to produce favor-
able results, it makes the assumption that random careless responding is
unrelated to true ability. If this assumption is untenable, the removal of
examinees will result in either an inflation or underestimation of the true
mean depending on whether filtered examinees are predominately of low or
high ability.

Response-Level Filtering. One alternative to examinee-level filtering is


response-level filtering. That is, instead of listwise deleting data from an
examinee deemed to be unmotivated, filtering occurs for individual
responses classified as random careless responses. Such an approach is
advantageous in that valid responses are not discarded, particularly as previ-
ous research has shown that most examinees engage in non-random careless
responding early in the test (Wise & Kong, 2005). Further, application of
examinee-level filtering can result in discarding up to 25% of the sample
data (e.g., Rios, Liu, & Bridgeman, 2014), which prevents the inclusion of
removed examinees for any future analyses (e.g., longitudinal tracking).
Although response-level filtering provides practical advantages (i.e., retain-
ing as much data as possible), it has not been recommended in practice for
two reasons: (1) it could lead to the comparison of examinees on a different
number of valid items if computing raw scores (Wise, 2009), and (2) it has
not been found to drastically improve convergent validity coefficients
between filtered mean test scores (i.e., scores after removing random care-
less responses) and external measures of prior achievement (e.g., GPA,
SAT) (Kong, 2007; Wise, 2006). To overcome the former concern, this arti-
cle introduces an Item Response Theory (IRT) approach to employing
response-level filtering that allows for both the avoidance of filtering valid
data (as is done with examinee-level filtering) and the comparison of exam-
inees across all items.
WHEN AND HOW TO FILTER CARELESS RESPONSES 5

Study Objectives
As previous research has been limited in providing practical recommendations to
practitioners on when and how random careless responding should be dealt with,
the objectives of this study were to evaluate (a) the degree of random careless
responding that would have a practically significant impact on aggregated
scores, (b) the effectiveness of purifying biased aggregated scores for two filter-
ing procedures, examinee- and response-level filtering, and (c) the tenability of
the assumption that random careless responding is unrelated to ability, which
underlies the common practice of listwise deleting examinees deemed to be care-
less responders or unmotivated (i.e., examinee-level filtering). These study
objectives were addressed via two studies. In study 1 we investigated the follow-
ing research questions through data simulations:

1. What is the proportion of random careless responses that would lead to a


practically significant underestimation of aggregated scores?

Hypothesis: Underestimation is most impacted by two factors: (a) the total percent-
age of careless responses in the sample and (b) test difficulty. Specifically, it is
hypothesized that as the percentage of careless responses in the sample increases,
underestimation will also increase, but at a differential rate based on test difficulty.
That is, as the true probability of success for many examinees is high when a test is
easy, careless responding will lead to increased underestimation, but as the true
probability of success decreases (as is the case for more difficult tests) so does the
impact of careless responding on aggregated-scores.

2. Which filtering procedure, examinee- or response-level filtering, is most


effective in purifying biased aggregated scores?

Hypothesis: Both procedures will perform equally well when careless responding is
unrelated to ability, however, it is expected that examinee-level filtering will lead
to less accurate filtering when careless responding is related to ability. The reason
for the latter hypothesis is that when ability is related to careless responding, list-
wise deleting data from examinees deemed to be unmotivated (examinee-level fil-
tering) will alter the true proficiency distribution and result in an inaccurate
observed group score (Wise, Kingsbury, Thomason, & Kong, 2004). Response-
level filtering will be less susceptible to this issue as noncareless response data for
all examinees is included to compute expected scores. As long as the noncareless
responses serve as valid indicators of examinees’ true ability, the true proficiency
distribution should be minimally altered, and as a result, should lead to an accurate
observed group score.

To investigate the tenability of employing examinee-level filtering in practice,


study 2 employed empirical data to examine the following research question:
6 RIOS ET AL.

In practice, can it be assumed that random careless responding is unrelated to abil-


ity, which would allow for the purification of biased aggregated scores by remov-
ing examinees deemed to be careless or unmotivated?

Results from these studies have the potential to inform practitioners of the condi-
tions in which random careless responding is a major concern for group-level
score use and which approach is valid and effective for purifying biased scores.

STUDY 1

Method
To investigate the objectives of study 1, simulated data were analyzed. Next is a
description of how the data were generated, the independent variables that were
manipulated, and the analysis procedures that were implemented.

Data Generation. As previous research has demonstrated differing levels


of random careless responding among examinees (Wise & Kong, 2005), data
were generated for a 30-item multiple choice test separately for two groups: (1)
motivated and (2) unmotivated simulees (simulated examinees). For simplicity’s
sake, in this study simulees providing no random careless responses were labeled
as motivated, while those providing differing levels greater than zero were cate-
gorized as unmotivated.
Data generation for motivated simulees. Item responses were generated
based on the unidimensional three-parameter logistic (3PL) model:

1 ¡ ci
Pi ðuÞ D ci C ; (1)
1 C expf ¡ 1:7ai ðu ¡ bi Þg

where Pi ðuÞ indicates the probability of getting item i correct; ai is the item dis-
crimination parameter for item i; bi is the item difficulty parameter for item i;
and ci is the pseudo-guessing parameter for item i. Simulee ability parameters
(u) were randomly sampled from N(0, 1), while generating item parameters were
sampled from the following distributions:

ai » runif ð0:5; 1:5Þ


 
bi » N b i ; 1 (2)
ci » runif ð0:1; 0:25Þ;

where bi varied across conditions. The probability for simulee j, item i was then
compared to a random number taken from runif (0, 1). A correct response was
WHEN AND HOW TO FILTER CARELESS RESPONSES 7

given if the probability was larger than the random number, otherwise an incor-
rect response was given.
Data generation for unmotivated simulees. The data for unmotivated
simulees were generated very similarly to the approach described; however, an
additional step was added. Specifically, after generating item responses based on
the 3PL model, various amounts of these responses were randomly replaced by
random careless responses for unmotivated simulees (refer to Wise & DeMars,
2006, for a similar approach). Within this study, random careless responses were
conceptualized as possessing a correct item response probability equal to
chance; Pi ðuÞ D 0.25 (assuming four-response options for each generated item).
The decision to implement random careless responses with a Pi ðuÞ D 0.25
instead of generating these responses based on the assumptions of the absorbing
state (e.g., Jin & Wang, 2014), decreasing effort (e.g., Goegebeur, De Boeck,
Wollack, & Cohen, 2008) or difficulty-based (e.g., Cao & Stokes, 2008) models
was based on the findings of Wise and Kingsbury (2016). In their study, the fit
of these three models was compared to the effort-moderated model (Wise &
DeMars, 2006) for data obtained from a low-stakes formative assessment admin-
istered to 285,230 examinees.
Three main conclusions can be drawn from the Wise and Kingsbury (2016)
study. First, examinee behavior does not correspond to the assumption that ran-
dom careless responding only occurs when an examinee receives a condition-
ally difficult item (assumed by the difficulty-based model). This finding was
supported by previous research from Wise (2006), who found that item diffi-
culty was nonsignificantly related to examinee effort. Second, patterns from
the data did not demonstrate that random careless responding occurs: (a) gradu-
ally across the test (assumed by the decreasing effort model) or (b) due to a
switch from a motivated to unmotivated state (assumed by the absorbing state
model). Third, assuming no clear pattern of random careless responding may
be most advantageous as such behavior appears to occur idiosyncratically
(assumed by the effort-moderated model). As a result, random careless
responses were generated with a chance probability. Once the replacement pro-
cess was completed, data for the motivated and unmotivated simulees were
combined for analysis.

Independent Variables. Four independent variables were manipulated in


this study: (1) the percentage of random careless responses within an unmoti-
vated simulee, (2) the percentage of unmotivated simulees in the total sam-
ple, (3) whether random careless responding was related to ability, and (4)
test difficulty. Clearly, in practice, the percentage of random careless
responses within unmotivated examinees will differ (based on empirical oper-
ational data, one commonly sees a large variation ranging from 0% to 100%);
however, for this study, simulees were equally constrained to possess three
8 RIOS ET AL.

levels: 10%, 25%, and 50%. In addition to within-simulee levels of random


careless responding, different percentages of unmotivated simulees were
manipulated: 10%, 25%, and 50%. In combining the different levels for these
two independent variables, we produced overall percentages of random care-
less responses that have been seen in operational and previous simulation
work (Wise & DeMars, 2006), which across conditions were: 1%, 2.5%, 5%,
6.25%, 12.5%, and 25%.
Additionally, we also manipulated whether random careless responses were
related to ability or not. Specifically, if it was assumed that random careless
responding was related to low ability, 100% of unmotivated simulees possessed
generating thetas less than zero (using stratified random sampling), regardless of
test difficulty. This was done to evaluate the most extreme bias on aggregated
scores as it is assumed by examinee-level filtering that careless responding is
unrelated to ability. In contrast, if random careless responding was assumed to
be unrelated to ability, unmotivated simulees were randomly sampled across the
ability distribution.
The sampling strategy implemented to reflect the condition of motivation
being related to ability was employed based on the hypothesis that low ability
examinees may be more prone to random careless responding as they possess
low academic self-esteem (Thompson, Davidson, & Garber, 1995), which
according to the self-worth theory of achievement motivation (e.g., Coving-
ton & Beery, 1976), is strongly related to self-protective behavior (i.e., pro-
viding an external reason for failure to protect one’s global self-esteem;
Thompson et al., 1995). In the testing context, self-protective behavior may
be exhibited as examinees either placing less value on the assessment or
directly employing less effort so that one can attribute possible negative self-
perceptions or feedback of ability to external reasons unrelated to actual abil-
ity (Jagacinski & Nicholls, 1990). Recent work supports this theory as test-
taking effort has been suggested to be best predicted by subjective value as
well as expectations for success (Penk & Schipolowski, 2015). Clearly, there
may be multiple reasons (e.g., examinee characteristics and the testing con-
text) for low test-taking motivation within an examinee or across examinees;
however, for the purposes of this study, low examinee test-taking motivation
was assumed to be due to self-protective behavior based on a low expectation
of success.
The final independent variable evaluated was test difficulty, which was manip-
ulated by varying the mean item difficulty from the generating item parameter dis-
tribution to three levels: ¡1, 0, and 1. These three levels allowed for us to
investigate the degree of underestimation by random careless responding for three
types of tests: easy, N (¡1, 1), moderately difficult, N (0, 1), and difficult, N (1, 1).
To summarize, the following independent variables and levels were
examined:
WHEN AND HOW TO FILTER CARELESS RESPONSES 9

 Percentage of within-simulee random careless responses: 10%, 25%, and


50%
 Proportion of unmotivated simulees: 10%, 25%, and 50%
 Relation of ability and random careless responding: related and unrelated
 Test difficulty: easy, moderately difficult, and difficult

These four independent variables and their respective levels were fully crossed,
which resulted in a 3 £ 3 £ 2 £ 3 design for a total of 54 conditions. Across con-
ditions, the number of simulees was constrained to 500, and to minimize sam-
pling error, each condition was replicated 100 times. As a result, the total
number of simulees within each condition was 50,000.

Analyses. The analytic procedures for the two research questions in this
study differed slightly and as a result, are described separately below.
Impact of random careless responding on aggregated scores. The analy-
ses for this part of the study were quite simple. In fact, to evaluate the
degree of biasing of random careless responding on aggregated scores, we
compared the observed average total score (included random careless
responses) with the true mean, which was based on the generating data for
all simulees without random careless responses. Comparisons were made by
conducting independent t-tests and calculating Cohen’s (d) effect sizes.
Based on Cohen’s (1988) recommendations, a d greater than 0.20 was clas-
sified as a nonnegligible difference.
Comparability of filtering procedures in purifying biased aggregated
scores. To compare the recovery of the true mean by examinee- and
response-level filtering, we first filtered examinees and then compared
observed (based on filtering) and true means as described. Examinee-level fil-
tering was implemented by listwise deleting data for any unmotivated simu-
lees. The employment of response-level filtering was a bit more involved as
IRT expected scores were computed based on a three-stage process. First,
random careless responses were recoded as blank. Such a decision was based
on the idea that a careless response is not a valid indicator of an examinee’s
ability and as such, any response option provided should not be scored as is
nor should the response automatically be scored as incorrect. Secondly, item
and person parameters for the recoded data were estimated based on the mod-
ified-3PL model using marginal Maximum Likelihood estimation in the R ltm
package (Rizopoulos, 2006; step 2). The modified-3PL model differed from
the generating 3PL model in that the c-parameter was constrained to 0.25
across all items for the purpose of gaining stable parameter estimates, which
has been recommended by Han (2012). As a result, realistic error was built
into the estimates by using a generating 3PL model and estimating data with
the modified-3PL model. The third step was to place the estimated person
and item parameters into the 3PL model to obtain an expected group mean
across all items as follows:
" #
X J X I
1 ¡ c
^ i D 1 j ^u j ; a^i ; b^i ; ci / D ci C
P.x
i
; (3)
jD1 iD1 1 C exp[ ¡ a^i .^u j ¡ b^i /]

where J D the total number of simulees or examinees, I D the total number


^ i D 1 j ^u j ; a^i ; b^i ; ci /D the estimated probability of correctly answer-
of items, P.x
ing item i based on estimated person and item parameters from step 2, ^u j D the
estimated proficiency for examinee j from step 2, a^i D the estimated discrimi-
nation parameter for item i from step 2, b^i D the estimated difficulty parameter
for item i from step 2, ci D 0.25. The probabilities for examinee j across I items
was summed to obtain an expected raw score and an expected group mean was
obtained by summing the raw scores across J examinees, which ranged from
0–30 points. This latter score is what one would expect the sample to obtain
had all examinees not careless responded on any of the items. It should be
noted that the computation of expected scores assumes that nonflagged
responses are valid indicators of ability.

Results

Impact of Random Careless Responding on Aggregated Scores.


Table 1 presents a number of interesting results. For one, regardless of the
within-simulee proportion of random careless responses, test difficulty, or the
relationship between random careless responding and ability, when the percent-
age of unmotivated simulees was equal to 10%, there was not a practically sig-
nificant underestimation of the true mean. The reason for this finding is that
even when simulees carelessly responded on 50% of the items, the total amount
of random careless responses in the sample was only equal to 5%. As a result,
there was little impact on the true mean. A second trend observed was that the
percentage of random careless responses in the sample had a differential biasing
effect on aggregated scores based on test difficulty (Figure 1). Specifically,
fewer random careless responses in the sample were needed to significantly bias
scores as the test became easier. For example, for both conditions in which moti-
vation was related and unrelated to ability, the true mean was significantly
underestimated (d > 0.20) if the total amount of random careless responses
exceeded 6.25%, 12.5%, and 12.5% for easy (ability unrelated: d D ¡0.24, abil-
ity unrelated: d D ¡0.20), moderately difficult (ability unrelated: d D ¡0.34,
ability unrelated: d D ¡0.28), and difficult tests (ability unrelated: d D ¡0.27,
ability unrelated: d D ¡0.22), respectively. The last trend observed was that
biasing effects were very similar when random careless responding was and was

10
TABLE 1
Impact of Random Careless Responding on Aggregated-Scores
Percentage of Total Random Careless Random Careless
Within Amount Responding Unrelated Responding Related
Examinee of Random to Ability to Ability
Random Careless
Test Unmotivated Careless Responses “True” “True” Observed Observed Observed Observed
Difficulty Examinees Responses in Sample Mean SD Mean SD d Mean SD d

Easy 10% 10% 1.00% 21.82 1.06 21.67 1.06 -0.04 21.58 1.33 -0.03
(b D -1.0) 10% 25% 2.50% 21.86 1.17 21.53 1.16 -0.09 21.56 1.10 -0.08
10% 50% 5.00% 21.62 1.28 20.91 1.21 -0.18* 20.97 1.02 -0.16*
25% 10% 2.50% 21.87 1.17 21.53 1.14 -0.10 21.50 1.11 -0.09
25% 25% 6.25% 21.95 1.18 21.10 1.11 -0.24* 21.22 1.12 -0.20*
25% 50% 12.50% 21.85 1.11 20.07 0.98 -0.43* 20.09 1.07 -0.37*
50% 10% 5.00% 21.75 1.10 21.04 1.04 -0.20* 21.05 1.00 -0.17*
50% 25% 12.50% 21.95 1.28 20.27 1.15 -0.46* 20.37 0.95 -0.39*
50% 50% 25.00% 21.76 1.08 18.17 0.82 -0.85* 18.56 0.85 -0.69*
Moderate 10% 10% 1.00% 18.76 1.20 18.66 1.18 -0.03 18.44 1.13 -0.03
Difficulty 10% 25% 2.50% 18.90 1.10 18.63 1.07 -0.07 18.66 1.10 -0.06
(b D 0) 10% 50% 5.00% 18.83 1.10 18.26 1.04 -0.14* 18.27 1.21 -0.12*
25% 10% 2.50% 18.78 1.12 18.50 1.10 -0.07 18.36 1.20 -0.06
25% 25% 6.25% 18.55 1.19 17.90 1.14 -0.17* 18.05 1.19 -0.14*
25% 50% 12.50% 18.85 1.16 17.43 1.02 -0.34* 17.55 0.97 -0.28*
50% 10% 5.00% 18.74 1.11 18.19 1.07 -0.14* 18.08 1.14 -0.12*
50% 25% 12.50% 18.75 1.27 17.44 1.12 -0.34* 17.77 1.01 -0.29*
50% 50% 25.00% 18.72 1.20 15.92 0.91 -0.68* 16.45 0.84 -0.55*

11
(continued on next page)
12 RIOS ET AL.

TABLE 1
Impact of Random Careless Responding on Aggregated-Scores (Continued)

Percentage of Total Random Careless Random Careless


Within Amount Responding Unrelated Responding Related
Examinee of Random to Ability to Ability
Random Careless
Test Unmotivated Careless Responses “True” “True” Observed Observed Observed Observed
Difficulty Examinees Responses in Sample Mean SD Mean SD d Mean SD d

Difficult 10% 10% 1.00% 15.71 1.12 15.63 1.11 -0.02 15.62 1.19 -0.02
(b D 1.0) 10% 25% 2.50% 15.84 1.16 15.65 1.14 -0.05 15.37 1.15 -0.04
10% 50% 5.00% 15.73 1.20 15.31 1.14 -0.11 15.55 1.08 -0.09
25% 10% 2.50% 15.64 1.07 15.45 1.06 -0.05 15.45 1.03 -0.05
25% 25% 6.25% 15.53 1.07 15.06 1.00 -0.13 15.36 1.05 -0.11
25% 50% 12.50% 15.82 1.10 14.78 0.98 -0.27* 15.05 0.96 -0.22*
50% 10% 5.00% 15.83 1.25 15.40 1.19 -0.12 15.46 1.06 -0.09
50% 25% 12.50% 15.65 1.16 14.70 1.03 -0.26* 14.94 1.00 -0.21*
50% 50% 25.00% 15.62 1.03 13.59 0.77 -0.52* 14.09 0.89 -0.41*

Note. The direction of the effect sizes (d) was calculated by subtracting the true mean from the observed mean (with random careless responses). * p < .05
WHEN AND HOW TO FILTER CARELESS RESPONSES 13

FIGURE 1
The impact of random careless responding on aggregated scores by percentage of random
careless responses in the total sample, test difficulty, and relationship between ability and
random careless responding. Note that values greater than j :20 j were defined as of practi-
cal concern in this study (as represented by the horizontal dashed line). Furthermore, Easy
Unrelated D easy test difficulty, ability is unrelated to random careless responding; Easy
Related D easy test difficulty, ability is related to random careless responding; Moderate
Unrelated D moderate test difficulty, ability is unrelated to random careless responding; Mod-
erate Related D moderate test difficulty, ability is related to random careless responding; Hard
Unrelated D difficult test, ability is unrelated to random careless responding; Hard Related D
difficult test, ability is related to random careless responding.

not related to ability; however, the small differences were exacerbated for the
conditions in which the total amount of random careless responding in the sam-
ple was equal to 25%. For example, when random careless responding was unre-
lated to ability for this condition, the degree of underestimation was as much as
0.15 SDs greater than when it was related to ability. This was most likely due to
a greater random sampling of high ability simulees for the unrelated condition.

Comparability of Filtering Procedures in Purifying Biased Aggregated


Scores. As the previous section demonstrated that for easy tests as little as
6.25% of random careless responses in the total sample can lead to practically
significant underestimation of the true mean, it was important to compare the
effectiveness of two filtering procedures, examinee- and response-level filtering,
on purifying the biased aggregated scores. Results are first presented for condi-
tions in which random careless responding was unrelated to ability. Table 2 pro-
vides the standardized mean difference scores by filtering procedure. Across
conditions, examinee-level filtering was found to purify biased mean scores
nearly perfectly as d ranged from 0 to 0.01 (Figure 2). In contrast, response-level
filtering was found to slightly differ from the true mean, and this difference was
observed to increase as the tests became more difficult. That is, for an easy test
14
TABLE 2
Comparison of Filtering Approaches on Purifying Aggregated-Scores When Random Careless Responding is Unrelated to Ability
Percentage of Total
Within Amount Examinee-Level Response-Level
Examinee of Random Filtering Filtering
Random Careless
Test Unmotivated Careless Responses “True” “True” Observed Observed Observed Observed
Difficulty Examinees Responses in Sample Mean SD Mean SD d Mean SD d

Easy 10% 10% 1.00% 21.82 1.06 21.82 1.07 0.00 21.81 1.04 0.00
(b D -1.0) 10% 25% 2.50% 21.86 1.17 21.86 1.18 0.00 21.86 1.16 0.00
10% 50% 5.00% 21.62 1.28 21.63 1.27 0.00 21.62 1.25 0.00
25% 10% 2.50% 21.87 1.17 21.87 1.18 0.00 21.90 1.16 0.01
25% 25% 6.25% 21.95 1.18 21.93 1.17 -0.01 21.95 1.16 0.00
25% 50% 12.50% 21.85 1.11 21.86 1.10 0.00 21.89 1.11 0.01
50% 10% 5.00% 21.75 1.10 21.79 1.09 0.01 21.80 1.11 0.01
50% 25% 12.50% 21.95 1.28 21.96 1.28 0.00 22.00 1.30 0.02
50% 50% 25.00% 21.76 1.08 21.71 1.11 -0.01 21.79 1.11 0.01
Moderate 10% 10% 1.00% 18.76 1.20 18.76 1.21 0.00 18.76 1.22 0.00
Difficulty 10% 25% 2.50% 18.90 1.10 18.92 1.10 0.00 18.90 1.12 0.00
(b D 0) 10% 50% 5.00% 18.83 1.10 18.83 1.10 0.00 18.82 1.13 0.00
25% 10% 2.50% 18.78 1.12 18.78 1.13 0.00 18.78 1.14 0.00
25% 25% 6.25% 18.55 1.19 18.55 1.20 0.00 18.54 1.23 0.00
25% 50% 12.50% 18.85 1.16 18.85 1.16 0.00 18.86 1.19 0.00
50% 10% 5.00% 18.74 1.11 18.74 1.11 0.00 18.76 1.15 0.00
50% 25% 12.50% 18.75 1.27 18.77 1.29 0.00 18.75 1.29 0.00
50% 50% 25.00% 18.72 1.20 18.72 1.24 0.00 18.71 1.24 0.00

(continued on next page)


TABLE 2
Comparison of Filtering Approaches on Purifying Aggregated-Scores When Random Careless Responding is Unrelated to Ability (Continued)

Percentage of Total
Within Amount Examinee-Level Response-Level
Examinee of Random Filtering Filtering
Random Careless
Test Unmotivated Careless Responses “True” “True” Observed Observed Observed Observed
Difficulty Examinees Responses in Sample Mean SD Mean SD d Mean SD d

Difficult 10% 10% 1.00% 15.71 1.12 15.71 1.13 0.00 15.65 1.16 -0.02
(b D 1.0) 10% 25% 2.50% 15.84 1.16 15.83 1.15 0.00 15.78 1.21 -0.02
10% 50% 5.00% 15.73 1.20 15.74 1.18 0.00 15.65 1.25 -0.02
25% 10% 2.50% 15.64 1.07 15.65 1.08 0.00 15.58 1.12 -0.02
25% 25% 6.25% 15.53 1.07 15.53 1.07 0.00 15.45 1.11 -0.02
25% 50% 12.50% 15.82 1.10 15.80 1.10 0.00 15.74 1.15 -0.03
50% 10% 5.00% 15.83 1.25 15.82 1.25 0.00 15.75 1.30 -0.03
50% 25% 12.50% 15.65 1.16 15.68 1.18 0.01 15.55 1.23 -0.03
50% 50% 25.00% 15.62 1.03 15.61 1.08 0.00 15.50 1.09 -0.04

Note. The direction of the effect sizes (d) was calculated by subtracting the “true” mean from the filtered. As a result, a positive effect size would indicate
that the filtering procedure overestimated the “true” mean or artificially inflated the mean after removing unmotivated examinees (examinee-level filtering) or
flagged responses (response-level filtering).
* p < .05

15
16 RIOS ET AL.

FIGURE 2
Bias on true mean scores when employing examinee-level filtering by percentage of
unmotivated simulees, test difficulty, and relationship between random careless respond-
ing and ability.

the difference ranged from 0.01 to 0.02 SDs, while ranging from 0.02 to 0.04 for
a difficult test; however, these filtered means were not statistically or practically
different from the true mean (Figure 3). As a result, when unmotivated simulees
were sampled from across the ability distribution, both examinee- and response-
level filtering were found to perform equally well.
Table 3 presents the comparative results between filtering procedures for
when unmotivated simulees were assumed to be low ability. As opposed to the
previous findings, examinee-level filtering was found to predominately have
higher rates of inflation when compared to response-level filtering (Figures 2
and 3). Specifically, as the percentage of unmotivated simulees increased, the

FIGURE 3
Bias on true mean scores when employing response-level filtering by percentage of random
careless responses in the total sample, test difficulty, and relationship between random care-
less responding and ability.
TABLE 3
Comparison of Filtering Approaches on Purifying Aggregated-Scores When Random Careless Responding is Related to Ability

Examinee-Level Response-Level
Percentage of Total Amount
Filtering Filtering
Within Examinee of Random
Test Unmotivated Random Careless Careless Responses “True” “True” Observed Observed Observed Observed
Difficulty Examinees Responses Guesses in Sample Mean SD Mean SD d Mean SD d

Easy 10% 10% 1.00% 21.82 1.06 21.85 1.34 0.04 21.69 1.32 -0.00
(b D -1.0) 10% 25% 2.50% 21.86 1.17 22.03 1.13 0.04 21.87 1.11 0.00
10% 50% 5.00% 21.62 1.28 21.76 1.09 0.04 21.64 1.06 0.01
25% 10% 2.50% 21.87 1.17 22.30 1.14 0.13* 21.84 1.12 -0.00
25% 25% 6.25% 21.95 1.18 22.45 1.20 0.13* 22.00 1.16 0.01
25% 50% 12.50% 21.85 1.11 22.18 1.19 0.13* 21.78 1.19 0.03
50% 10% 5.00% 21.75 1.10 23.08 1.04 0.39* 21.71 1.05 0.01
50% 25% 12.50% 21.95 1.28 23.34 1.04 0.40* 21.97 1.07 0.02
50% 50% 25.00% 21.76 1.08 23.17 1.10 0.39* 21.98 1.13 0.07
Moderate 10% 10% 1.00% 18.76 1.20 18.70 1.14 0.04 18.53 1.17 -0.00
Difficulty 10% 25% 2.50% 18.90 1.10 19.07 1.11 0.04 18.90 1.14 0.00
(b D 0) 10% 50% 5.00% 18.83 1.10 18.92 1.26 0.04 18.79 1.28 -0.01
25% 10% 2.50% 18.78 1.12 19.09 1.25 0.13* 18.61 1.26 -0.00
25% 25% 6.25% 18.55 1.19 19.12 1.29 0.13* 18.64 1.29 0.01
25% 50% 12.50% 18.85 1.16 19.27 1.13 0.13* 18.86 1.13 0.03
50% 10% 5.00% 18.74 1.11 20.06 1.22 0.38* 18.56 1.22 -0.00
50% 25% 12.50% 18.75 1.27 20.44 1.19 0.39* 19.01 1.18 0.02
50% 50% 25.00% 18.72 1.20 20.40 1.14 0.38* 19.16 1.16 0.07

(continued on next page)

17
18
TABLE 3
Comparison of Filtering Approaches on Purifying Aggregated-Scores When Random Careless Responding is Related to Ability (Continued)

Examinee-Level Response-Level
Percentage of Total Amount
Filtering Filtering
Within Examinee of Random
Test Unmotivated Random Careless Careless Responses “True” “True” Observed Observed Observed Observed
Difficulty Examinees Responses Guesses in Sample Mean SD Mean SD d Mean SD d

Difficult 10% 10% 1.00% 15.71 1.12 15.85 1.19 0.04 15.62 1.24 -0.02
(b D 1.0) 10% 25% 2.50% 15.84 1.16 15.66 1.19 0.04 15.46 1.22 -0.02
10% 50% 5.00% 15.73 1.20 16.07 1.14 0.04 15.88 1.17 0.01
25% 10% 2.50% 15.64 1.07 16.11 1.10 0.13* 15.56 1.10 -0.02
25% 25% 6.25% 15.53 1.07 16.25 1.14 0.13* 15.73 1.16 0.01
25% 50% 12.50% 15.82 1.10 16.41 1.10 0.12* 15.95 1.14 0.00
50% 10% 5.00% 15.83 1.25 17.22 1.20 0.37* 15.77 1.15 -0.01
50% 25% 12.50% 15.65 1.16 17.15 1.22 0.37* 15.72 1.19 0.01
50% 50% 25.00% 15.62 1.03 17.24 1.25 0.38* 15.95 1.19 0.05

Note. The direction of the effect sizes (d) was calculated by subtracting the “true” mean from the filtered mean. As a result, a positive effect size would
indicate that the filtering procedure overestimated the “true” mean or artificially inflated the mean after removing unmotivated examinees (examinee-level fil-
tering) or flagged responses (response-level filtering). * p < .05
WHEN AND HOW TO FILTER CARELESS RESPONSES 19

inflation of the true mean increased. As an example, when the total amount of
unmotivated simulees was equal to 10%, the filtered mean was .04 SDs (p >
0.05) greater than the true mean; however, when the percentage of unmotivated
simulees increased to 25% and 50%, the true mean was inflated by as much as
0.13 (p < 0.05) and 0.40 (p < 0.05) SDs, respectively (Table 3). These findings
were robust to the percentage of within-simulee random careless responding and
test difficulty for two reasons: (1) simulees were filtered regardless of whether
they rapid guessed on 10% or 50% of the items, and (2) valid responses were
completely removed from unmotivated simulees and as a result, the difference
in simulee ability and test difficulty had no impact. In contrast, the degree of
score inflation for response-level filtering was relatively consistent as when
unmotivated simulees were randomly sampled from across the ability distribu-
tion. That is, at most, when applying this type of filtering, the aggregated score
was overestimated by 0.07 SDs when random careless responses comprised 25%
of the total response sample (Figure 3); however, across conditions, there was
no significant bias on the aggregated scores as was seen with examinee-level fil-
tering. As a result, response-level filtering was found to be robust to the condi-
tion of random careless responding being related to ability.
Although this study demonstrated that response-level filtering provided fil-
tered means that were not significantly biased under conditions of motivation
being un/related to ability, it is limited in two ways. First, it makes the assump-
tion that any response not flagged can be used for the accurate estimation of
examinee ability. However, such an assumption is difficult to evaluate, particu-
larly when only response times are available (i.e., other evidence sources of cog-
nition, such as measures of eye-tracking, electroencephalogram [EEG], and
emotional state, are needed). Second, there are testing programs that may rely
solely on the use of classical test theory (CTT) for scoring and as result, cannot
employ response-level filtering as it requires the use of IRT. As a result, many
practitioners may see the use of examinee-level filtering as the only option for
purifying biased aggregated scores. Consequently, it is vital to assess whether
the assumption underlying this filtering approach is tenable in practice.

STUDY 2

To evaluate the assumption that random careless responding is unrelated to ability,


most researchers have computed discriminant validity coefficients between the total
proportion of nonrandom careless responses (response time effort; RTE) and meas-
ures of ability (e.g., GPA, SAT). These analyses have shown strong discriminant
validity as the coefficients obtained have been small and nonsignificant (r D –0.04
to r D 0.15; Kong et al., 2007; Rios et al., 2014; Sundre & Wise, 2003; Wise &
DeMars, 2005; Wise & Kong, 2005; Wise, Wise, & Bhola, 2006). However, these
20 RIOS ET AL.

results may largely be due to the statistical artifact of possessing highly negatively
skewed RTE distributions, which artificially attenuate the true correlation coeffi-
cients. As a result, it is argued that the validity of examinee-level filtering is best
examined when comparing performance differences between examinees deemed to
be motivated and unmotivated. To this end, a few studies have examined mean
score differences on prior achievement measures between motivation groups (e.g.,
Allen et al., 2016; Rios et al., 2014; Wise & Kong, 2005). As an example, Wise
and Kong found that examinees who randomly careless responded on 20% or more
of items actually scored slightly lower than their engaged counterparts on the SAT
(d D ¡0.33). Similarly, Allen and colleagues (2016) found that disengaged examin-
ees on a college writing task possessed significantly lower self-reported ACT scores
when compared to engaged examinees (d D ¡0.52). The results from these two
studies may suggest a relationship between test-taking motivation and ability, which
calls for further examination of the issue, particularly as this relationship impacts the
decision of which filtering procedure to employ.
As a result of the limited research in this area and the fact that we showed
examinee-level filtering to lead to large bias when unmotivated examinees are
predominately of low ability, the objective of this study was to evaluate perfor-
mance differences between motivated and unmotivated examinees on expected
performance for a large-scale low-stakes assessment and prior achievement mea-
sured by the SAT. Such an analysis should provide some evidence regarding the
tenability of the assumption underlying examinee-level filtering in practice.

Method

Data. A computer-based version of the ETS Proficiency Profile (EPP), a


low-stakes 108-item college-level test assessing critical thinking, reading, writing,
and mathematics was administered to freshmen students from a Midwestern uni-
versity. Test administration consisted of two one-hour sessions with 54 items
being administered in each testing session. In total, only 72 of the 108 items were
used in the analysis as it was impossible to identify random careless responding
for all items across the three response latency procedures described in the next
section (more details are provided below). In total, 1448 examinees were admin-
istered the EPP; however, due to the computation of aggregated scores, only com-
plete data (i.e., no missing responses) were used for this study. Complete data
were available for 1322 examinees, which provided response times for 95,184
item responses (1322 examinees x 72 items). In addition, SAT (combined reading
and quantitative) scores were available for 1319 of the examinees.

Analysis. To assess the tenability of the assumption that random careless


responding is unrelated to ability, two comparative analyses were conducted. The
first analysis compared differences in filtered mean scores between examinee-
WHEN AND HOW TO FILTER CARELESS RESPONSES 21

and response-level filtering. Although the true mean was not known as real data
were evaluated, if the examinee-level filtered mean was higher than that of
response-level filtering, the results from study 1 would suggest that such a trend
may largely be artificial inflation of the mean due to the removal of low ability
examinees. To support the results from the first analysis, motivation groups were
compared in terms of expected group means on the EPP and observed means on
the SAT. However, before conducting either analysis, random careless responses
and in turn, unmotivated examinees had to be first classified. For the purposes of
study 2, response latency was used an indicator of random careless responses as
true random careless responses were not known as in study 1.
Three response time threshold procedures were implemented to classify random
careless responding: the three-second (3SEC), normative threshold (three levels:
15% [NT15], 20% [NT20], 25% [NT25]), and visual inspection (VI) procedures.
Each one of these procedures functions by defining a random careless response as
any response given in less time than that set by the response time threshold, which
differs across procedures. Specifically, a random careless response is defined as any
response given in less than (a) three seconds for the 3SEC procedure and (b) a per-
centage (15%, 20%, or 25%) of the mean item response time for the NT procedure.
For example, if the NT20 procedure (using 20% of the mean item response time as
the threshold) was employed and the mean item response time was 25 seconds, a
random careless response would be defined as a response provided in less than five
seconds (threshold: 25 seconds £ 0.20 D 5 seconds). In contrast, the VI proce-
dure defines a response time threshold as the time in which the nonsolution (i.e.,
examinees tend to respond very quickly without processing the item content) and
solution behavior (i.e., examinees try to seek the correct response; Kong et al.,
2007) response time distributions meet. For example in Figure 4, a random

FIGURE 4
An example of a response time frequency distribution with rapid guessing.
22 RIOS ET AL.

FIGURE 5
An example of a response time frequency distribution with no clear pattern of rapid guessing.

careless response would be defined as a response given in less than seven seconds.
However, a clear delineation between solution and nonsolution behavior distribu-
tions is not always clear (see Figure 5 for an example). When this is the case, ran-
dom careless responding cannot be clearly defined and identified, which was the
case for 36 of the 108 items in the present dataset. As a result, these items were
dropped from the analysis for the purpose of comparing the results across the
3SEC, NT, and VI threshold procedures.
Once random careless responses were flagged, examinees were classified
as unmotivated for each flagging procedure by using response time effort
(RTE; Wise & Kong, 2005), which was calculated as:
!
X I
RTEj D 1 ¡ .SBij /=I ; (4)
iD1

where RTEj is equal to response time effort for examinee j, I is equal to the total
number of items, and is equal to solution behavior for examinee j on item i. For, a
value of 1 was given to examinee j on item i for any items identified as a random care-
less response, while a value of 0 was given if examinee j spent more time answering
item i than the threshold set by the respective flagging method. Traditionally, data
from examinees are arbitrarily removed if their RTE is less than 0.90; however, to
assess the adequacy of various RTE thresholds for examinee-level filtering, we set
thresholds from 0.50 to 0.90 in increments of 0.10, but, for brevity’s sake, we only
present results for a RTE threshold of 0.20 following the recommendations of Hauser
and Kingsbury (2009)2.
2
If interested, the reader can contact the first author for full results.
WHEN AND HOW TO FILTER CARELESS RESPONSES 23

Once responses were flagged and examinees were classified as motivated or


unmotivated, both examinee and response-level filtering were applied to compute
filtered means as described in study 1. In terms of comparing performance
between motivation groups, an IRT expected mean for the EPP (ranging from 0 to
72 points) was calculated separately by group using the process described in study
1. Expected group means were then compared for all examinees classified as moti-
vated or unmotivated using t-tests and effect sizes (Cohen’s d). As the assumption
underlying the use of expected scores (i.e., any response not flagged is a valid
indicator of ability) is difficult to evaluate without other cognitive data sources,
the results were cross-validated by comparing combined SAT verbal and math
scores across groups. The use of SAT scores was justified as a cross-validation
measure for three reasons: (1) it is a high-stakes assessment that should not be
impacted by low test-taking motivation, (2) the constructs measured by the EPP
and SAT are closely related (Liu, 2008), and (3) the time between administration
of the EPP and SAT was less than one year. For this analysis, if random careless
responding was unrelated to ability, one would expect to see no significantly large
differences in expected test performance or prior achievement between groups.

Results

Descriptives. Table 4 presents the average number of random careless


responses per examinee as well as the percent of total careless responses
represented in the sample by flagging procedure. As shown, the average
number of flagged responses by examinee differed by as much as 3.35

TABLE 4
Results of Rapid Response Flagging and Examinee-Level Filtering to EPP Data

Rapid Response Examinee-Level Response-Level


Flagging Filtering Filtering
% of
Proportion Total
Method Mean SD Correct Responses N Mean SD d N Mean SD d

3SEC 4.61 9.75 .22 6.41 1202 37.48 10.81 .12* 1322 36.26 9.93 .00
NT15 5.64 10.85 .21 7.84 1162 37.80 10.80 .15* 1322 36.31 9.89 .01
NT20 6.81 11.74 .21 9.47 1118 38.24 10.68 .19* 1322 36.37 9.85 .01
NT25 7.82 12.39 .22 10.87 1081 38.57 10.66 .22* 1322 36.41 9.83 .02
VI 7.96 12.04 .22 11.05 1078 38.59 10.66 .22* 1322 36.43 9.82 .02

Note. The expected proportion correct for flagged responses was chance (.25). The percent of total
responses column provides the percentage of careless responses in the total sample based on the
respective flagging procedure. Unfiltered group mean (SD) D 36.17 (11.18). d was based on the dif-
ference between the unfiltered and filtered means. As a result, a positive d indicates that after filtering
the mean was higher than when unfiltered. * p < .05
24 RIOS ET AL.

responses when comparing the most liberal (3SEC) and strict (VI) rules.
Across procedures, the proportion correct for the flagged responses was
close to that expected by chance, which provided some evidence that the
careless responding identified may have been random. However, based on
the findings from study 1, the actual impact of careless responding on
aggregated scores in this sample may not have been significant as (a) this
assessment was of moderate difficulty (the filtered mean item difficulty
ranged from 0.27 to 0.40 [differed by flagging procedure] based on a modi-
fied 3PL model), and (b) the percent of total careless responses in the
sample ranged from as little as 6.41% (3SEC) to 11.05% (VI).

Evaluation of the Underlying Examinee-Level Filtering Assumption.


Once responses were flagged, examinees were classified as unmotivated if they
were found to possess flagged responses on 20% or more of the items. This led to
relatively large differences in classifications between flagging procedures. For
example, the 3SEC rule classified 120 (9.10%) examinees as unmotivated, while
more than twice as many examinees were classified using the VI method (244;
18.46%; Table 4).Upon removing unmotivated examinees from the total sample
and computing filtered means by threshold procedure, the mean was found to
increase by 0.12 to 0.22 SDs. By itself, such a result would suggest that random
careless responding had a negligible to small biasing effect on the sample aggre-
gated score; however, when inspecting the filtered means based on response-level
filtering, the difference from the unfiltered mean ranged from 0 to 0.02 SDs
(Table 4). From study 1, we know that the only condition in which the employment
of examinee-level filtering led to larger filtered means when compared to response-
level filtering was when the assumption underlying examinee-level filtering was
untenable. To validate this finding, the next step was to evaluate performance dif-
ferences between motivation groups on the EPP (expected performance) and SAT.
As is shown in Table 5, motivated examinees were expected to score on aver-
age between 37.42 and 38.53 points on the EPP depending on the threshold

TABLE 5
Differences in Expected EPP and SAT Performance by Motivation Group

Expected EPP Performance SAT Performance


Motivated Unmotivated Motivated Unmotivated
Method N Mean SD N Mean SD d N Mean SD N Mean SD d

3SEC 1202 37.42 9.59 120 25.49 5.84 1.50* 1047 966.49 143.50 97 921.44 117.10 .34*
NT15 1162 37.75 9.54 160 26.21 5.81 1.46* 1012 967.89 144.25 127 917.08 115.05 .38*
NT20 1118 38.18 9.42 204 26.71 5.93 1.45* 975 971.67 143.98 159 915.03 115.17 .43*
NT25 1081 38.50 9.38 241 27.11 5.84 1.45* 939 974.74 144.65 193 907.09 115.47 .51*
VI 1078 38.53 9.36 244 27.17 5.94 1.44* 939 974.31 144.61 193 908.08 114.10 .50*

Note. d is the effect size when comparing the difference between engaged and disengaged
expected group means. Results are based on a RTE threshold of .20. * p < .05
WHEN AND HOW TO FILTER CARELESS RESPONSES 25

procedure. In comparison, unmotivated examinees were found to possess expected


group means ranging from 25.49 to 27.17 points on a 72-point scale. This resulted
in standardized differences (d) ranging from 1.44 to 1.50 SDs across flagging pro-
cedures, which indicated that had examinees not randomly careless responded on
any items, the examinees classified as engaged would be expected to score well
over one standard deviation higher than their disengaged counterparts.
However, this finding may have been biased as examinees classified as disen-
gaged could have put little effort across the entire test, even on those responses
that were not flagged as a random careless response. Therefore, to account for
this possible confounding, we cross-validated our results by evaluating the dif-
ference in prior achievement (combined SAT verbal and quantitative scores)
across groups. Our results demonstrated that the motivated examinees scored
between 966.49 and 974.31 points on a 1600 point-scale. In comparison, unmoti-
vated examinees scored between 908.08 and 921.44 points, which led to stan-
dardized differences (d) ranging between 0.34 and 0.51 SDs across flagging
procedures (Table 5). Although the differences were smaller when compared to
expected performance on the EPP, such a finding suggests that the unmotivated
examinees were of lower ability than their engaged counterparts, which puts into
question the tenability of the assumption underlying examinee-level filtering.

DISCUSSION

As the use of group-level low-stakes testing prevails, there is an increasing


need to better understand whether random careless responding has a signifi-
cant impact on inferences made from aggregated scores and if so, how to
best deal with such behavior. To this end, in this study we demonstrated, as
hypothesized, that the impact of random careless responding on aggregated
scores is dependent on the interaction between test difficulty and the per-
centage of random careless responses in the sample. That is, as the test
became easier, the percentage of random careless responses needed to sig-
nificantly underestimate (d > 0.20) the mean score decreased. This finding
can be explained as the degree of bias in aggregated scores is largely due to
differences between the expected probability of a correct response and the
probability of a random guess. As the difference in probabilities increases,
the degree of bias on the aggregated scores is increased, which is why we
observed the largest biasing effects on an easy test.
In addition to evaluating the impact of random careless responding on
aggregated scores, a comparative analysis of filtering approaches was con-
ducted. This was done by proposing a simple IRT method (response-level
filtering) and comparing it to the most popular method in the literature
(examinee-level filtering). As hypothesized, a simulated investigation of
26 RIOS ET AL.

filtering accuracy demonstrated two important findings: (1) when ability is


unrelated to random careless responding (assumed by examinee-level filter-
ing), examinee-level filtering provided accurate scores; however, when this
assumption was untenable, significant overestimation of aggregated scores
(d > 0.20) occurred under certain conditions, and (2) accurate filtering of
aggregated scores by response-level filtering was found across all simulated
conditions investigated. These results next led us to examine whether the
assumption underlying examinee-level filtering is tenable in practice. Our
analysis of applied data illustrated that examinees with high rates of random
careless responding on average possessed lower expected performance on
the assessment of interest as well as lower ability on a high-stakes measure
of prior achievement (SAT). Consequently, if this result holds true for other
samples, the utility of employing examinee-level filtering for operational
use is questionable.

Limitations
In general, the findings from this study are limited to random careless
responding in relation to multiple-choice assessments and cannot be general-
ized to other forms of responding that do not accurately reflect an examinee’s
skill or ability (e.g., content responsive faking, inattentiveness, linguistic
incompetence, and LongString responding [providing the same response con-
secutively across a number of items]) and/or surveys and questionnaires;
however, there has been extensive previous work conducted in these areas
(e.g., Curran, 2015; Huang, Liu, & Bowling, 2015; Meade & Craig, 2011). In
addition to this general limitation, there are study-specific limitations that
should be noted. For study 1, one of the major limitations was regarding how
response-level filtering was employed. Specifically, data were generated with
a 3PL model, but were estimated using a constrained 3PL model in which the
c-parameter was set equal to 0.25 across all items. As a result, the estimated
item parameters (including a and b) were biased due to the different models
used for data generation and estimation, which may have had an impact on
the recovery of the true mean by the response-level filtering procedure. Con-
sequently, this filtering procedure may have been more accurate than demon-
strated. Nevertheless, such item parameter estimation bias is to be expected
when administering multiple-choice items as the c-parameter may be difficult
to estimate (Han, 2012).
For study 2, one limitation was that results were based on data from one
institution administered the EPP, and therefore the findings may not general-
ize to examinees in other institutions or other types of tests. An additional
limitation is that the expected means were computed based on the overarch-
ing assumption that nonflagged responses were valid indicators of careless
WHEN AND HOW TO FILTER CARELESS RESPONSES 27

responses. Although we showed that a performance difference was main-


tained when evaluating SAT scores between motivation groups, it is possible
that some examinees may have still employed nonsolution behavior that went
undetected by the flagging procedures. Further, we computed expected scores
for examinees regardless of the number of total careless responses. Although
this was expected to have little to no impact on the aggregated scores, further
research is needed to evaluate the maximum number of within-examinee
careless responses that lead to acceptable estimations of expected scores.

Future Research
There are three important areas of future research. The first is related to the
development and/or validation of procedures to identify careless responding.
As noted in the previous sections, there has been previous work related to
identifying methods in both survey and cognitive assessments; however,
when it comes to methods that evaluate response latencies as a proxy of
careless responding, there is a need to use additional cross-validation meth-
ods to ensure that the criteria used for classification are accurate. To this
end, it is suggested that a multimodal approach to validating response
latency approaches be taken by evaluating examinee behavior, such as eye
movement and key-stroke behavior. The latter approach may be particularly
important for identifying careless responding on constructed response items
as there has been limited research on this topic (e.g., Steedle, 2014).
A second important area of future research is related to the development
of new filtering approaches. This study has illustrated that the biasing effect
of careless responding on aggregated scores is largely dependent on the
overall percentage of careless responses in the sample. Yet, the majority of
literature has focused on the degree of within-examinee careless responding,
which is reflected in the development and popularity of the examinee-level
filtering procedure. Although this study has shown the effectiveness of
employing IRT to compute expected scores, what happens if a testing pro-
gram does not employ IRT (i.e., cannot use response-level filtering)? Fur-
thermore, what can be done if the assumption underlying examinee-level
filtering is found to be untenable? In such a case, the only option may be to
allow the random careless responses to remain in the total sample. Conse-
quently, there is a need to develop new filtering procedures that do not rely
on IRT and can limit the biasing seen in examinee-level filtering when abil-
ity is related to random careless responding.
Lastly, greater focus should be placed on approaches to handling careless
responding a priori. There is a rich literature related to the design and
implementation of interventions geared toward improving examinee test-tak-
ing motivation, which spans back to the 1940s (Rothe, 1947). These
28 RIOS ET AL.

interventions have attempted to improve test-taking motivation by manipu-


lating (a) the perception of testing consequences (e.g., Liu, Rios, & Borden,
2015), (b) proctoring procedures (e.g., Ward & Pond, 2015), (c) warnings
for off-task behaviors (e.g., Wise, Bhola, & Yang, 2006), (d) incentives
(e.g., Braun, Kirsch, & Yamamoto, 2011), (e) delivery of feedback during
the testing session (e.g., Ling, Attali, Finn, Stone, 2016), and (f) testing
mode (e.g., Ortner, Weikopf, & Koch, 2014). Although research has been
extensive, there has been no synthesis to evaluate the effectiveness of such
interventions to this point. As a result, there is a need for a meta-analytic
study that investigates the overall impact of motivation interventions on
improving test-taking motivation and test performance on both noncognitive
and cognitive measures in low-stakes testing contexts. Findings from such a
study may provide best-practice guidelines for practitioners dealing with the
threat of test-taking motivation before it can bias test scores.

Implications
Overall, the results from our studies provide a number of implications for practi-
tioners. First, although the removal of any random careless response is desirable,
it must be considered whether the degree of random careless responses has a
practically significant bias on aggregated scores and warrants the need for filter-
ing. This evaluation should not be based on the number of unmotivated examin-
ees (i.e., examinees with amounts of careless responses deemed to be too large),
but rather on the combination between test difficulty and total percent of careless
responses in the sample. Second, if the percent of careless responding is found to
be of significant concern, it is recommended that response-level filtering should
be applied to improve data quality due to its less restrictive underlying assump-
tions and superior filtering accuracy. If one is unable to employ response-level
filtering due to a lack of knowledge concerning IRT or testing program con-
straints (e.g., a program only uses CTT), examinee-level filtering should be
employed; however, before doing so, the underlying assumption that ability is
unrelated to random careless responding should be evaluated. This recommenda-
tion relates to the fact that when this assumption is untenable, examinee-level fil-
tering can create more bias than leaving the data unfiltered. If this assumption
cannot be tested due to a lack of prior achievement data or the assumption is
found to be untenable, one solution could be to iteratively remove examinees
with the highest percentages of within-examinee careless responding until the
overall percent of careless responses in the sample is within an acceptable range
(refer to the findings from study 1). Such an approach may allow for filtering
careless responses while mitigating artificial inflation from removing low ability
examinees. These recommendations, when applied correctly, should assist prac-
titioners with managing the threat of low test-taking motivation (from a post-hoc
WHEN AND HOW TO FILTER CARELESS RESPONSES 29

standpoint) to improve the validity of inferences made from low-stakes group-


based assessments.

ACKOWLEDGMENTS
The authors would like to thank Brent Bridgeman, Guangming Ling, and James
Carlson from the Educational Testing Service for their comments on an earlier draft.

REFERENCES

Allen, L. K., Mills, C., Jacovina, M. E., Crossley, S., D’Mello, S., & McNamara, D. S. (2016, April).
Investigating boredom and engagement during writing using multiple sources of information: The
essay, the writer, and keystrokes. Paper presented at the 6th International Learning Analytics and
Knowledge Conference, Edinburgh, Scotland.
American Educational Research Association. (2000). Position statement of the American Educational
Research Association concerning high-stakes testing in PreK–12 education. Educational
Researcher, 29, 24–25.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (2014). Standards for educational and psychological test-
ing (6th ed.). Washington, DC: American Educational Research Association.
Baumert, J., & Demmrich, A. (2001). Test motivation in the assessment of student skills: The effects of
incentives on motivation and performance. European Journal of Psychology of Education, 16, 441–462.
Braun, H., Kirsch, I., & Yamamoto, K. (2011). An experimental study of the effects of monetary
incentives on performance on the 12th-grade NAEP reading assessment. Teachers College Record,
11, 2309–2344.
Buchholz, J., & Rios, J. A. (2014, July). Examining response time threshold procedures for the identi-
fication of rapid-guessing behavior in small samples. Paper presented at the 9th Conference of the
International Test Commission, San Sebastian, Spain.
Cao, J., & Stokes, S. L. (2008). Bayesian IRT guessing models for partial guessing behaviors.
Psychometrika, 73, 209–230.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Erlbaum.
Covington, M. V., & Beery, R. (1976). Self-worth and school learning. New York: Holt, Rinehart, &
Winston.
Curran, P. G. (2015). Methods for the detection of carelessly invalid responses in survey data. Jour-
nal of Experimental Social Psychology. Advance online publication. http://dx.doi.org/10.1016/j.
jesp.2015.07.006
De Ayala, R. J., Plake, B., & Impara, J. C. (2001). The effect of omitted responses on the accuracy of
ability estimation in item response theory. Journal of Educational Measurement, 38, 213–234.
Debeer, D., Buchholz, J., Hartig, J., & Janssen, R. (2014). Student, school, and country differences in
sustained test-taking effort in the 2009 PISA reading assessment. Journal of Educational and
Behavioral Statistics, 39, 502–523.
DeMars, C. E., & Wise, S. L. (2010). Can differential rapid-guessing behavior lead to differential
item functioning? International Journal of Testing, 10, 207–229.
Goegebeur, Y., De Boeck, P., Wollack, J. A., & Cohen, A. S. (2008). A speeded item response model
with gradual process change. Psychometrika, 73, 65–87.
Goldhammer, F., Martens, T., Christoph, G., & L€udtke, O. (2016). Test-taking engagement in PIAAC
(OECD Education Working Papers, No. 133). Paris, France: OECD Publishing.
30 RIOS ET AL.

Guo, H., Rios, J. A., Haberman, S., Liu, O. L., Wang, J., & Paek, I. (2016). A new procedure for
detection of students’ rapid guessing responses using response time. Applied Measurement in Edu-
cation. Advance online publication. doi:10.1080/08957347.2016.1171766
Han, K. T. (2012). Fixing the c parameter in the three-parameter logistic model. Practical Assess-
ment, Research & Evaluation, 17, 1–24.
Hauser, C., & Kingsbury, G. G. (2009, April). Individual score validity in a modest-stakes adaptive
educational testing setting. Paper presented at the meeting of the National Council on Measure-
ment in Education, San Diego, CA.
Huang, J. L., Curran, P. G., Keeney, J., Poposki, E. M., & DeShon, R. P. (2011). Detecting and deterring
insufficient effort responding to surveys. Journal of Business and Psychology, 27, 99–114.
Huang, J. L., Liu, M., & Bowling, N. A. (2015). Insufficient effort responding: Examining an insidi-
ous confound in survey data. Journal of Applied Psychology, 3, 828–845.
Jagacinski, C. M., & Nicholls, J. G. (1990). Reducing effort to protect perceived ability: “They’d do
it, but I wouldn’t.” Journal of Educational Psychology, 82, 15–21.
Jin, K., & Wang, W. (2014). Item response theory models for performance decline during testing.
Journal of Educational Measurement, 51, 178–200.
Kong, X. J. (2007). Using response time and the effort-moderated model to investigate the effects of
rapid guessing on estimation of item and person parameters (Unpublished doctoral dissertation).
Harrisonburg, VA: James Madison University.
Kong, X. J., Wise, S. L., & Bhola, D. S. (2007). Setting the response time threshold parameter to dif-
ferentiate solution behavior from rapid-guessing behavior. Educational and Psychological Mea-
surement, 67, 606–619.
Lee, Y.-H., & Jia, Y. (2014). Using response time to investigate students’ test-taking behaviors in a
NAEP computer-based study. Large-scale Assessments in Education, 2, 1–24.
Ling, G., Attali, Y., Finn, B., & Stone, E. (2016). Is a computerized adaptive test more motivating
than a fixed item test? Manuscript submitted for publication.
Liu, O. L. (2008). Measuring learning outcomes in higher education using the Measure of Academic
Proficiency and Progress (MAPP) (ETS RR-08-47). Princeton, NJ: Educational Testing Service.
Liu, O. L., Rios, J. A., & Borden, V. (2015). Does motivational instruction affect college students’ per-
formance on low-stakes assessment? An experimental study. Educational Assessment, 20, 79–94.
Meade, A. W., & Craig, S. B. (2011, April). Identifying random careless responses in survey data.
Paper presented at the 26th annual meeting of the Society for Industrial and Organizational Psy-
chology, Chicago, IL.
Ortner, T. M., Weißkopf, E., Koch, T. (2014). I will probably fail: Higher ability students’ motiva-
tional experiences during adaptive achievement testing. European Journal of Psychological
Assessment, 30, 48–56.
Osborne, J. W., & Blanchard, M. R. (2011). Random responding from participants is a threat to the
validity of social science research results. Frontiers in Psychology, 1, 1–7.
Penk, C., & Schipolowski, S. (2015). Is it all about value? Bringing back the expectancy component
to the assessment of test-taking motivation. Learning and Individual Differences, 42, 27–35.
Rios, J. A., Liu, O. L., & Bridgeman, B. (2014). Identifying unmotivated examinees on student learn-
ing outcomes assessment: A comparison of two approaches. New Directions for Institutional
Research, 161, 69–82.
Rizopoulos, D. (2006). ltm: An R package for latent variable modeling and item response theory
analyses. Journal of Statistical Software, 17, 1–25.
Rothe, H. F. (1947). Distribution of test scores in industrial employees and applicants. Journal of
Applied Psychology, 31, 480–483.
Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two-state mixture model:
A new method of measuring speededness. Journal of Educational Measurement, 34, 213–232.
WHEN AND HOW TO FILTER CARELESS RESPONSES 31

Steedle, J. T. (2014). Motivation filtering on a multi-institution assessment of general college out-


comes. Applied Measurement in Education, 27, 58–76.
Sundre, D. L., & Wise, S. L. (2003, April). Motivation filtering: An exploration of the impact of low
examinee motivation on the psychometric quality of tests. Paper presented at the meeting of the
National Council on Measurement in Education, Chicago, IL.
Thompson, T., Davidson, J. A., & Garber, J. G. (1995). Self-worth protection in achievement motivation:
Performance effects and attributional behavior. Journal of Educational Psychology, 87, 598–610.
Ward, M. K., & Pond, S. B. (2015). Using virtual presence and survey instructions to minimize care-
less responding on internet-based surveys. Computers in Human Behavior, 48, 554–568.
Wise, S. L. (2006). An investigation of the differential effort received by items on a low-stakes, com-
puter-based test. Applied Measurement in Education, 19, 95–114.
Wise, S. L. (2009). Strategies for managing the problem of unmotivated examinees in low-stakes test-
ing programs. The Journal of General Education, 58, 152–166.
Wise, S. L. (2015). Effort analysis: Individual score validation of achievement test data. Applied
Measurement in Education, 28, 237–252.
Wise, S. L., Bhola, D. S., Yang, S. (2006). Taking the time to improve the validity of low-stakes tests:
The effort-monitoring CBT. Educational Measurement: Issues and Practice, 2, 22–30.
Wise, S. L., & DeMars, C. E. (2005). Examinee motivation in low-stakes assessment: Problems and
potential solutions. Educational Assessment, 10, 1–18.
Wise, S. L., & DeMars, C. E. (2006). An application of item response time: The effort-moderated
IRT model. Journal of Educational Measurement, 43, 19–38.
Wise, S. L., & DeMars, C. E. (2009). A clarification of the effects of rapid guessing on coefficient
alpha: A note on Attali’s reliability of speeded number-right multiple-choice tests. Applied Psycho-
logical Measurement, 33, 488–490.
Wise, S. L., & DeMars, C. E. (2010). Examinee non-effort and the validity of program assessment
results. Educational Assessment, 15, 27–41.
Wise, S. L., Kingsbury, G. G. (2016). Modeling student test-taking motivation in the context of an
adaptive achievement test. Journal of Educational Measurement, 53, 86–105.
Wise, S. L., Kingsbury, G. G., Thomason, J., & Kong, X. (2004, April). An investigation of motiva-
tion filtering in a statewide achievement testing program. Paper presented at the meeting of the
National Council on Measurement in Education, San Diego, CA.
Wise, S. L., & Kong, X. (2005). Response time effort: A new measure of examinee motivation in
computer-based tests. Applied Measurement in Education, 18, 163–183.
Wise, S. L., & Ma, L. (2012, April). Setting response time thresholds for a CAT item pool: The nor-
mative threshold method. Paper presented at the meeting of the National Council on Measurement
in Education, Vancouver, Canada.
Wise, S. L., Ma. L., Cronin, J., & Theaker, R. A. (2013, April). Student test-taking effort and the
assessment of student growth in evaluating teacher effectiveness. Paper presented at the meeting
of the National Council on Measurement in Education, San Francisco, CA.
Wise, V. L., Wise, S. L., & Bhola, D. S. (2006). The generalizability of motivation filtering in
improving test score validity. Educational Assessment, 11, 65–83.

You might also like