2016 LaneOswald EMIP

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/308081054

Do 45% of College Students Lack Critical Thinking Skills? Revisiting a Central


Conclusion of Academically Adrift

Article  in  Educational Measurement Issues and Practice · September 2016


DOI: 10.1111/emip.12120

CITATIONS READS

15 2,251

2 authors:

David Mark Lane Frederick L Oswald


Rice University Rice University
103 PUBLICATIONS   3,553 CITATIONS    178 PUBLICATIONS   9,784 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Relative Importance View project

Impact of team configuration and team stability on primary care quality View project

All content following this page was uploaded by Frederick L Oswald on 20 October 2019.

The user has requested enhancement of the downloaded file.


Educational Measurement: Issues and Practice
Fall 2016, Vol. 35, No. 3, pp. 23–25

Do 45% of College Students Lack Critical Thinking Skills?


Revisiting a Central Conclusion of Academically Adrift
David Lane and Frederick L. Oswald, Rice University

The educational literature, the popular press, and educated laypeople have all echoed a conclusion
from the book Academically Adrift by Richard Arum and Josipa Roksa (which has now become
received wisdom), namely, that 45% of college students showed no significant gains in critical
thinking skills. Similar results were reported by Pascarella, Blaich, Martin, and Hanson after the
publication of Arum and Roksa’s book in 2011. However, these authors’ statistical tests were
conducted incorrectly, and therefore this 45% finding is fundamentally untrue. We demonstrate that
a correct statistical analysis would have found that far fewer students show significant gains in
critical thinking. However, this does not reflect on student learning; instead, it reflects on how hard
it is to find a statistically significant result when assessing student change on a student-by-student
basis. This article discusses valid methods for testing the significance of gain scores of individual
students.

Keywords: Academically Adrift, change scores, critical thinking

Sappointment
ince 2011, the educational research literature, the popu-
1
lar press, and even Bill Gates’s blog have expressed dis-
and shock at the research finding that 45% of col-
first by Astin (2011), who argued that (1) the authors focused
on statistically significant gains, but not whether these gains
were practically meaningful, and (2) the measured differ-
lege students fail to show statistically significant gains in criti- ences were unreliable, meaning that if any important gains in
cal thinking. Clearly, this is a very high percentage that would learning existed, they would be very hard to detect. Initially,
appear to imply an urgent and critical need to reform college we felt that Astin addressed these issues sufficiently—but
teaching, the academic curriculum, and the college environ- upon closer examination, we discovered that the statistical
ment in general. This dismal finding was reported in the book tests conducted by Arum and Roksa are fundamentally incor-
Academically Adrift: Limited Learning on College Campuses, rect. Given the frequent reference to the the 45% finding, we
where the authors, Richard Arum and Josipa Roksa (2011a), believe this error merits a brief explanation and a correction
conclude that “we observe no statistically significant gains to the scientific record.
in critical thinking, complex reasoning, and writing skills for The key question here is how to determine whether a given
at least 45 percent of the [college] students in our study” student’s measured change in critical thinking (on the CLA)
(p. 36). Similarly, in a related article on the topic, they re- is statistically significant. The approach taken by Arum and
peat that “[f]orty-five percent of students did not demonstrate Roksa (2011a) and repeated by Pascarella et al. (2011) is
any significant improvement in learning, as measured by CLA analogous to a repeated-measures t-test of differences be-
performance, during their first 2 years of college” (Arum & tween means. Take a variable X measured at two points in
Roksa, 2011b, p. 204). Pascarella, Blaich, Martin, and Hanson time—let’s call these measures X1 and X2 . In a repeated-
(2011), using the same statistical method with new data, ob- measures t-test, the numerator contains (M2 – M1 ), which
tained comparable results. Astin (2011) correctly predicted reflects the mean change between these measures; and the de-
that “this 45-percent conclusion is well on its way to becoming nominator contains its corresponding standard error, which
part of the folklore about American higher education,” and is equal to
time has borne this prediction out. sd
This conclusion in Academically Adrift comes from the s Md = √ ,
statistical analysis of data on the College Learning Assess- N
ment (CLA), viewed as a measure of critical thinking. The where s M d is the estimated standard error of the difference
CLA was administered to college students at the beginning between means, s d is the standard deviation of the difference
of their freshman year and again at end of their sophomore scores (X 1 − X 2 ), and N is the number of observations.
year, with the difference in CLA scores being interpreted as The repeated-measures t-test is therefore
change in students’ critical thinking over these 2 years. Arum M2 − M1
and Roksa’s conclusion that 45% of college students did not t= .
increase their critical thinking skills was astutely criticized s Md
Arum and Roksa (2011a) developed a significance test
David Lane, Rice University, Houston, TX 77005-1892; applied to each student where the numerator contains
Lane@rice.edu. Frederick L. Oswald, Rice University, Houston, TX (T2 − T1 ), which is the difference between his/her Time 2
77005-1892; foswald@rice.edu. (T2 ) and Time 1 (T1 ) score, and the denominator contains


C 2016 by the National Council on Measurement in Education 23
the standard error of the difference between means. In other A method also based on the SEM but that measures change
words, as we confirmed in communication with Richard Arum more precisely, because it estimates true scores as part of
(Arum, 2012), their statistical test of the gain for each stu- the computations, was proposed by Hageman and Arrindell
T −T (1993). We do not include Hageman and Arrindell’s formulas
dent i on the CLA was computed as ti = 2is M 1i , which is
d here because they are rather lengthy; however, the reader
incorrect. will find them clearly explicated in their article. There may
This test is faulty for two reasons: First, the denominator be other useful methods as well. The general point here is
here is the standard error of the difference between means, that the interpretation of CLA gain scores was based on an
when it should be the standard error for the individual stu- erroneous formula, and other reasonable computations are
dent. There is no logical or statistical reason why the assess- available.
ment of an individual student’s improvement (in the numer- To reiterate, testing the change in students’ critical think-
ator) should depend on the variability of improvements across ing should have been based on estimates of variation of indi-
students (in the denominator). vidual student changes in the CLA measure, with these esti-
Second, important to the 45% finding of Academically mates computed from the SEM. The test should not have been
Adrift, a difference between means is much more stable than based on an estimate of the variation of the mean change.
a difference between the test scores of any individual student. Based on the CLA data and assuming correct statistical tests,
Because the error in the denominator for the formula above it would actually have been surprising that as many as 55% of
is much less than what it would be for an individual change students (i.e., 100% − 45%) demonstrated statistically signif-
score, the resulting erroneous t statistics produces too many icant gains. Instead, one would have expected to find many
statistically significant findings (i.e., this incorrect t statis- fewer significant gains, because the error associated with the
tic produces values that are too high, with corresponding CLA difference computed on each student would be expected
p-values that are too low). to be relatively large and certainly much larger than the
Here is another way to think about this error in compu- group-level error that was (mistakenly) used by Arum and
tation. If one is ultimately interested in the change in each Roksa (2011a).
student’s critical thinking skills, then the number of students As Astin (2011) also pointed out, merely deciding that a
participating in the study should not influence the statistical student’s improvement was statistically different from 0 is
precision of each student’s change in critical thinking—but not very informative; it is also critically important to know
with Arum and Roksa’s analysis, it does.2 For example, if whether the improvement is practically significant. There-
100,000 students had been tested on the CLA, then the same fore, we suggest that researchers specify an improvement
analysis by Arum and Roksa would have produced a radically score that represents a practically significant effect and com-
different substantive conclusion: All or nearly all of students pare improvements against that value. One straightforward
would show significant changes in critical thinking—even way to do this is through 95% confidence intervals on the
though nothing here has changed except for the number improvement score for each student to see whether the in-
of students tested. Using appropriate statistical methods, terval is (a) above this value, indicating the improvement
such as the ones recommended by Christensen and Mendoza is practically significant; (b) below this value, indicating the
(1986) or Hageman and Arrindell (1993) described below, a improvement is not practically significant; or (c) straddling
sample of 100,000 would not materially affect the conclusion this value, indicating the practical significance of the im-
about the change in each student’s CLA score across two time provement is undetermined. This approach is identical to
points. conducting two one-tailed t-tests (each at the .025 level) for
The question, then, is how do you determine whether whether the observed difference is greater than or less than a
changes in critical thinking are strongly supported by the practical significance benchmark (Kaiser, 1960; Schuirmann,
data— or could it have occurred by chance? Christensen and 1987). If one is only interested in determining the proportion
Mendoza (1986) suggested a method based on comparing who improve, then a one-sided 95% confidence interval or,
students’ change scores against its error computed from the equivalently, a single one-tailed t-test at the .05 level can be
standard error of measurement (SEM). They called this error computed. Note that either of the Christensen and Mendoza
Sdiff , and it is computed as (1986) and Hageman and Arrindell (1993) approaches can be
√ used as a basis for these confidence intervals and significance
Sdiff = 2SEM. tests.
Perhaps the gist of Arum and Roksa’s (2011a) conclu-
The formula for SEM is sion is not entirely dependent on their exact 45% figure, and
 perhaps students are not learning critical thinking skills as
SEM = S 2 (1 − r x x ), much as we think or as much as we would like as educators.
However, addressing this complexity critically depends on
where S2 is the variance of the test and rxx is the reliability of appropriate statistical testing, additional judgments about
the test. Naturally, the reliability must be known or estimated what are practically meaningful gains in critical thinking,
in order to compute SEM. and further investigation that compares the CLA to alter-
A statistical significance test for the change in an individual native measures of critical thinking. Our central point in
student i can then be computed as this commentary is that this 45% figure has garnered an in-
credible amount of attention from academic journals and
T2i − T1i popular press venues—yet as we have explained, this es-
zi =
Sdi f f timate was based on a statistical miscalculation, is funda-
mentally incorrect, and merits a correction to the scientific
using the normal distribution to compute probabilities. record.

24 
C 2016 by the National Council on Measurement in Education Educational Measurement: Issues and Practice
Notes 27, 2016, from http://chronicle.com/article/Academically-Adrift-a/
1 126371/
Entry from October 20, 2012: http://www.gatesnotes.com/Books/
Christensen, L., & Mendoza, J. L. (1986). A method of assessing change
Academically-Adrift
2 in a single subject: An alteration of the RC index. Behavior Therapy,
This general argument stands, but note that individual estimates of
17, 305–308.
longitudinal change in a multilevel analysis would, in fact, be influenced
Hageman, W. L., & Arrindell, W. A. (1993). A further refinement of
by others in the sample (e.g., outlying trajectories would “shrink” toward
the reliable change index by improving the pre–post difference
the overall trend somewhat).
score: Introducing the RCID. Behaviour Research and Therapy, 51,
693–700.
Kaiser, H. F. (1960). Directional statistical decisions. Psychological
References Review, 67, 160–167.
Arum, R. (2012). Personal communication, May 6, 2012. Pascarella, E. T., Blaich, C., Martin, G. L., & Hanson, J. M. (2011).
Arum, R., & Roksa, J. (2011a). Academically adrift: Limited learning How robust are the findings of Academically Adrift? Change: The
on college campuses. Chicago: University of Chicago Press. Magazine of Higher Learning, 43, 20–24.
Arum, R., & Roksa, J. (2011b). Limited learning on college campuses. Schuirmann, D. J. (1987). A comparison of the two one-sided tests
Society, 48, 203–207. procedure and the power approach for assessing the equivalence of
Astin, A. W. (2011). In Academically Adrift, data don’t back up average bioavailability. Journal of Pharmacokinetics and Biophar-
sweeping claim. Chronicle of Higher Education. Retrieved July maceutics, 15, 657–680.

Fall 2016 
C 2016 by the National Council on Measurement in Education 25

View publication stats

You might also like