Assessment of University Students' Critical Thinking: Next Generation Performance Assessment

International Journal of Testing
ISSN: 1530-5058 (Print) 1532-7574 (Online) Journal homepage: https://www.tandfonline.com/loi/hijt20
Assessment of University Students’ Critical

Thinking: Next Generation Performance
Assessment
Richard J. Shavelson, Olga Zlatkin-Troitschanskaia, Klaus Beck, Susanne

Schmidt & Julian P. Marino
To cite this article: Richard J. Shavelson, Olga Zlatkin-Troitschanskaia, Klaus Beck,

Susanne Schmidt & Julian P. Marino (2019): Assessment of University Students’ Critical
Thinking: Next Generation Performance Assessment, International Journal of Testing, DOI:
10.1080/15305058.2018.1543309
To link to this article: https://doi.org/10.1080/15305058.2018.1543309
Published online: 24 Jan 2019.
Submit your article to this journal
View Crossmark data
Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=hijt20
International Journal of Testing, 2019
Copyright # International Test Commission
ISSN: 1530-5058 print / 1532-7574 online
DOI: 10.1080/15305058.2018.1543309
Assessment of University Students’

Critical Thinking: Next Generation
Performance Assessment
Richard J. Shavelson
Graduate School of Education, Stanford University, USA
Olga Zlatkin-Troitschanskaia, Klaus Beck and Susanne Schmidt

Gutenberg School of Management & Economics, Johannes Gutenberg
University Mainz, Germany
Julian P. Marino
Universidad de los Andes Facultad de Ciencias, Colombia
Following employers’ criticisms and recent societal developments, policymakers

and educators have called for students to develop a range of generic skills such
as critical thinking (“twenty-first century skills”). So far, such skills have typic-
ally been assessed by student self-reports or with multiple-choice tests. An alter-
native approach is criterion-sampling measurement. This approach leads to
developing performance assessments using “criterion” tasks, which are drawn
from real-world situations in which students are being educated, both within and
across academic or professional domains. One current project, iPAL (The inter-
national Performance Assessment of Learning), consolidates previous research
and focuses on the next generation performance assessments. In this paper, we
present iPAL’s assessment framework and show how it guides the development
of such performance assessments, exemplify these assessments with a concrete
task, and provide preliminary evidence of its reliability and validity, which
allows us to draw initial implications for further test design and development.
Keywords: critical thinking, criterion-sampling measurement, evidence-centered

design, generic skills, performance assessment, performance task
Correspondence should be sent to Olga Zlatkin-Troitschanskaia, Gutenberg School of

Management & Economics, Johannes Gutenberg University Mainz, Jakob Welder-Weg 9, Mainz,
55099 Germany. E-mail: troitschanskaia@uni-mainz.de
2 SHAVELSON ET AL.
INTRODUCTION
The rapid expansion and diversification of higher education worldwide over

the past 50 years across providers, students, and countries has given reason to
concerns about “quality” and student learning outcomes (e.g., Shavelson,
2010). One such concern is whether college graduates have acquired skills
such as critical thinking, problem solving, perspective taking, and communi-
cating—often referred to as twenty-first century skills—needed to act not only
as professionals in their chosen fields but also as informed citizens in this
complex world.
This concern has given rise to a call for assessing these twenty-first century
skills in higher education (e.g., Tremblay, Lalancette, & Roseveare, 2012, p. 62).
It has been met with a variety of approaches ranging from self-report of learning
surveys to multiple-choice tests to performance assessments of critical thinking,
problem solving, and the like (for an overview, see Zlatkin-Troitschanskaia,
Shavelson, & Pant, 2018). The OECD’s Assessment of Higher Education
Learning Outcomes (AHELO) feasibility study (Tremblay et al., 2012), for
example, focused on both multiple-choice and performance assessment.
In the course of building assessments of these skills, the question has arisen
as to whether they are general (“generic”) skills that are applicable to all walks
of life or domain-specific skills applicable, for example, to civil engineering—
thinking critically, taking alternative stakeholder perspectives, and communi-
cating clearly.
The aim of this article is to describe and exemplify the work of the
international Performance Assessment of Learning (iPAL) consortium on the
measurement of generic twenty-first century skills across domains, with
a particular focus on critical thinking. iPAL grew out of two shortcomings of
the AHELO’s performance assessment that used tasks from the Collegiate
Learning Assessment (CLA, e.g., Shavelson, 2010). First, the group of experts
advising AHELO noted the absence of an assessment framework for the
performance assessment (OECD, 2013b). Second, given the limited prepar-
ation time for the study, only performance tasks from the CLA were used.
These tasks “proved excessively ‘American’ in an international context.”
(OECD, 2013a, p. 169). Tasks should be developed from multiple national
contexts and vetted for their applicability across participating nations and con-
texts (for the unintended consequences of exporting assessments, see, e.g.,
Oliveri, Lawless, & Young, 2015).
The iPAL collaborative includes representatives from Europe, the Americas,
and Asia (Shavelson, Zlatkin-Troitschanskaia, & Mari~no, 2018). It developed a
general construct definition for twenty-first century skills (Shavelson et al., 2018),
and a test framework using evidence-centered design (ECD) (Mislevy & Haertel,
ASSESSMENT OF UNIVERSITY STUDENTS’ CRITICAL THINKING 3
2007) and performance assessment framework (Shavelson, 2013) based on a cri-

terion-sampling measurement approach. The aim is to create an assessment that
focuses on generic “twenty-first century skills” across domains and to incorporate
new research on rational thought that goes beyond the current item formats, for
example, video and spreadsheets, and that produces reliable scores for individual
test-takers. This vision provided a conceptual and theoretical foundation for creat-
ing iPAL’s performance assessments.
In this article, we focus on performance assessment of one such twenty-first
century skill, critical thinking. In what follows, we describe the framework
and exemplify its application to the development and evaluation of a particular
newly developed performance assessment—“Wind Turbine.” We report
findings from this performance assessment from a pilot test with 30 German
undergraduate and graduate students. The results presented here offer
preliminary evidence on reliability and validity and allowed us to draw initial
implications for further test design and development.
THEORETICAL BACKGROUND ON PERFORMANCE ASSESSMENT
The notion of a performance assessment seems quite straightforward.

A performance assessment comprises a collection of constructed and selected
response tasks and items aimed at measuring an individual’s (or institution’s)
performance on particular skills such as critical thinking and perspective tak-
ing. The performance tasks are high-fidelity simulations of actual real-world
decision- or interpretation-situations found daily, for example, in newspapers
(Shavelson, 2013). Our work is highly influenced by McClelland (1973).
Paraphrasing his argument he might have said: if you want to know if a person
can think critically and make a reasoned decision about, for example, health
care, don’t give her or him an intelligence test. Rather, see what a person does
in actuality when confronted with the decision (see also Feinstein, 2014).
Even more concretely, McClelland (1973, p. 7) admonished us to measure per-
formance in the situation for which most intelligence and aptitude tests are
built to predict—criterion performance. For example, McClelland says that if
you want to know if a person can drive a car (criterion situation), sample his
or her ability to drive the car; do not use a multiple-choice test. Such a test,
however, would be appropriate for measuring knowledge of laws govern-
ing driving.
Based on McClelland’s approach, Shavelson, Beckum, and Brown (1974),
for example, developed a performance assessment to select applicants for
entry into the San Francisco Police Department. They conducted a thorough
analysis of the patrolman’s job and observed and analyzed in detail what
patrolmen did in performing the job. They specified the skills demanded and
4 SHAVELSON ET AL.
the universe of situations in which the skills were enacted. Taken together, the
job analysis represented both the universe of tasks performed by patrolmen
and the universe of situations in which they were performed. Situations
sampled from the job universe were built as high fidelity simulations of the
tasks patrolmen actually carry out (for additional examples of performance
tasks, see Seminara, Shavelson, & Parsons, 1967; for particularly pre-college
education, Lane & Stone, 2006; Shavelson, Baxter, & Pine, 1991; NAEP,
2009, p. 107; see also Fu, Raizen, & Shavelson, 2009).
Performance assessment typically comprises a combination of selected-
response (“respondent”) and constructed-response (“operant”) items and tasks.
To qualify as a performance assessment, however, there has to be at least one
concrete high-fidelity simulation of a criterion situation (and preferably more).
The task(s) should be sampled from a specified universe of tasks comprising
the criterion situation. Think of all the possible situations you might want to
observe a novice automobile driver navigate; better yet go out and observe
him/her. Then, sample the situations. From the driver’s performance on the
sample of tasks, infer her/his performance in the universe of driving situations
(with some degree of error).
We now turn to a description of the iPAL performance assessment frame-
work. Our focus is on performance tasks and not selected-response tasks
because there is plenty of textbook material on the latter (e.g., Secolsky &
Denison, 2018). More specifically, we focus on “test domain” analysis and
modeling, task and response sampling, scoring, and other matters (e.g., imple-
mentation) (for more details, see Shavelson et al., 2018).
iPAL PERFORMANCE ASSESSMENT FRAMEWORK
The overarching constructs underlying iPAL are students’ and more generally
citizens’ capacity when confronted with everyday complex life situations to,
for example, think critically about the environment, or take the perspective of
others in a business situation and communicate their ideas, beliefs, analyses
and decisions precisely. Typically, a real-world event or “problem to be sol-
ved” is presented in brief story format accompanied by information more or
less reliable and relevant to the event or problem (see example in the next sec-
tion). The story might require reasoning for a claim that admitting migrants
into a country raises the crime rate, with students combining several data sour-
ces and doing some basic calculation to generate reliable and useful informa-
tion to address the claim. In another case, the problem might tap thinking
critically about the message underlying an art exhibition that portrays the ten-
sion between engineering’s contribution to progress and its negative environ-
mental impact in pictures, sculptures, and literature. At other times, critical
thinking might be tapped in deciding on a course of action when varying sides

to a proposed civic project—for example, where to situate a prominent movie
mogul’s museum (if at all)—are aired.
Construct Definition: Critical Thinking

Critical thinking is conceived as the process of conceptualizing, analyzing or
synthesizing, evaluating and applying information to solve a problem, decide
on a course of action, find an answer to a given question or reach a conclu-
sion. It comprises different facets like evaluating claims, analyzing inferences,
weighing decisions, analyzing problems etc. (e.g., Kosslyn & Nelson, 2017).
Wheeler and Haertel (1993) conceptualized such higher-order thinking skills
as critical thinking by determining two types of context in which these skills
are employed:
a. contexts in which thought processes are needed for solving problems and
making decisions in everyday life, and
b. contexts in which mental processes can be applied that must be developed
by formal instruction, including processes such as comparing, evaluating
and justifying.
Therefore, the universe of tasks demanding generic critical thinking com-

prises the myriad everyday complex life situations. A prime source of situa-
tions may be found readily in, for example, newspaper sections (e.g., politics,
environment, education, business, and science).
Facets of Critical Thinking. Critical thinking is evoked by presenting a

situation in the form of a “story” or an “event”. To evoke critical thinking the
story and accompanying materials encompass a number of different facets.
While there will be disagreement as to which facets are important for assess-
ing university students’ learning to think critically, for iPAL, at least initially,
tasks evoking critical thinking employ four facets:
1. trustworthiness of the information—reliable, unreliable, uncertain;

2. relevance of the information—relevant as it pertains to the problem at
hand or irrelevant as it is not related to the problem;
3. proneness to judgmental/decision/bias—information plays to fast thinking
judgmental errors and well known biases; and
4. response to the story problem—reach a judgment, reach a decision, rec-
ommend a course of action, suggest a problem solution.
6 SHAVELSON ET AL.
Task Universe
The universe of tasks demanding generic critical thinking comprises the myr-
iad everyday complex life situations. iPAL samples such situations for inclu-
sion in performance tasks and more traditional items (e.g., multiple-choice). A
prime source of situations may be found easily in mass media (e.g., politics,
environment, business, and science). The Airplane task developed for the CLA
(Shavelson, 2010), for example, was inspired by the report of an aircraft crash
at the Van Nuys Airport in Southern California.
Performance tasks are complex often without a clear path toward solution,
decision or action. Rather there are tradeoffs. They admit to more than one
feasible solution; when incorporated into an assessment they have better and
worse solutions, decisions, actions etc. The tasks are compelling in the sense
they represent current everyday challenges that test-takers face or might be
expected to face as college graduates and, more generally, as citizens.
Once a construct such as critical thinking is chosen and a domain of life activ-
ities is selected, a search for possible assessment stories and specific tasks ensues
(an internet search is invaluable). Once chosen, the following is carried out:
1. An assessment story is built. The story is a short and motivates the per-
formance assessment activities.
2. Assessment tasks are developed to include certain elements that invite
test-takers to think critically—trustworthiness, relevance, proneness.
3. A response is requested that involves bringing evidence to bear from the
information given on the problem, activity or situation in order to justify a
decision, recommendation, course of action, etc.
Information-Source Sampling
Material such as newspaper articles, YouTube videos, government reports are
sampled from real-world domains and constructed to vary information in the
event. The information provided may be manipulated as to its:
a. Trustworthiness—trustworthy such as the Federal Aviation Report in the

Airplane task or not trustworthy such as an amateur aviator's newspaper
opinion article;
b. Relevance—directly relevant to the issue at hand (FAA Report) or irrele-
vant if tangential or unrelated to the task (photos of the SwiftAir 135
and 235).
c. Proneness to bias or errors in judgment and decision making—judg-
mental heuristics and biases may lead to predictable errors (e.g., mistaking
correlation for causality when thinking too quickly) or stereotypical
thinking (“representativeness heuristic”—engineers wear pen pocket pro-

tectors or racial bias).
Judgmental and Decision Heuristic and Bias Sampling

In using information to make judgments and decisions, people often take
shortcuts or use heuristics to make judgments or reach a decision. The work
of Tversky and Kahneman (1974) opened up a field that has become known
as rational thought (e.g., Kahneman, 2011; Stanovich, 2009). These heuristics
are normally applicable in the real world, where quick judgments or decisions
must be made and where deliberative thought might be dangerous (e.g., get
out of the crosswalk because the car isn’t going to stop). However, they can
interfere with rationality—critical thinking or problem solving—when the situ-
ation is important enough to demand a rational decision (e.g., buying a house).
In this case, deliberative thought is needed to simulate alternatives and their
consequences before judging or deciding.
Since Tversky and Kahneman's initial research, the list of judgmental and
decision making heuristics has exploded (e.g., Stanovich, 2016) and can be
easily researched on the internet. Consequently, irrational (when the situation
demands otherwise) thinking heuristics are built into performance tasks or
might be assessed in stand-alone multiple-choice questions.
The iPAL tasks use, for example, the kind of heuristics where unadjusted
data lead to a problematic decision if baseline conditions are ignored. There
are many other heuristics that can be incorporated into assessment tasks that
simulate, with high fidelity, everyday events. Moreover, the aim is to create a
separate selected-response portion of the iPAL that probes students’ ability to
resist “fast thinking” and slow down to “simulate” alternative courses of action
and their alternatives.
Response Considerations
The result of critical thinking is typically a problem solution, a decision, a rec-
ommended course of action, a judgment or direct action. In all cases, two ele-
ments are required:
1. The problem solution (etc.) must be justified with the information avail-
able in the assessment. That is, a strong response would:
use trustworthy information and avoid less-than-trustworthy information,
use relevant information and avoid peripheral information,
avoid judgmental and decision-making “traps” and biases, and
8 SHAVELSON ET AL.
consider alternative courses of action to the one proposed and indicate

why the recommendation is given.
2. The response given should be a high-fidelity simulation of the kind of
response that would be given in the real world. The ability to communi-
cate clearly, concisely, accurately, and compellingly is part of our concep-
tion of critical thinking—its “output.” The communication might be in
writing (e.g., a memo to the president of a company or an op-ed piece),
visually, orally with visuals (e.g., PowerPoint presentation with notes), or
other. Such communication would use concise compelling arguments from
the evidence provided to conclusions to rhetorically establish a position,
decision, course of action, or recommendation.
Task Format and Delivery

iPAL tasks are delivered on a computer platform. Computers provide substan-
tial leeway both in delivering tasks and in their fidelity to the real world that
they are intended to emulate. The task format decision is driven first and fore-
most by its fidelity to the criterion situation being simulated. This said, cost
and safety are also important considerations and they, too, must be incorpo-
rated into the format selection.
Multiple formats are used. Some formats are open-ended: Students con-
struct answers of varying length, for example, in response to a prompt inviting
them to make a judgment or decision. At least one sub-task will stimulate text
production of sufficient length to evaluate students’ writing as to:
1. the evidence presented from provided information to justify a decision or

recommend a course of action, and
2. the clarity and force of the argument presented.
Moreover, a spreadsheet might be used for calculations, simulations might be

used for modeling alternatives, and PowerPoint might be used for presentation
and justification of recommendations. An intranet containing trustworthy and
unreliable, relevant or irrelevant documents and so on might be used to examine
students’ capacity to search and bring evidence to handle a problem. Audio
might be used to enhance the fidelity of the simulated situation or to collect stu-
dents’ verbal presentations (e.g., Kuhn, Zlatkin-Troitschanskaia, Br€uckner, &
Saas, 2018). In the final analysis, the technology is subservient to the measured
construct, not vice versa. However, the technology provides a means of increas-
ing simulation fidelity beyond what is possible with pencil and paper.
Selected-response (e.g., multiple-choice) formats are used, for instance, to
probe critical reading of documents provided in the task, quantitative reasoning
with graphs or tables provided, or rational thinking with standalone prompts

(Stanovich, 2016). Other formats can be brief, self-contained tasks with either
short constructed responses or multiple-choice questions.
Relation to Evidence-Centered Design

From an ECD perspective, the claim is that the assessment task presented here taps
critical thinking on everyday complex issues, events, problems, and the like. The evi-
dence comes from evaluating test-takers’ responses to the assessment tasks and
potential accompanying analyses of response processes such as think-aloud inter-
views or log file analyses (Ercikan & Pellegrino, 2017; AERA, AEA, and NCME,
2014, p. 15). The tasks in the assessment manipulate the trustworthiness, relevance,
and judgmental aspects of the information that confront a person when thinking crit-
ically about important events.
iPAL Performance Assessment Development

The development of performance assessments is as much an art as it is a science
and probably more art. Considerable creativity and artistry is needed to translate
a real-world event such as an aircraft accident into an assessment of critical think-
ing. In a real sense creating a performance assessment is much like creating
a case to be used in a course on business, architecture, medicine, or law.
The assessment should reflect the reality of the criterion situation, it should be
engaging, it should be problematic enough to elicit the intended construct (e.g.,
critical thinking) and it should offer the opportunity for multiple solution paths
each of which is more or less justifiable from the information provided.
However, ingenuity and creativity can be overdone. The assessment
may become so creative that it does not focus clearly on the construct to be
measured or the claim to be made, namely the assessment taps critical think-
ing. Consequently, science underlies the construction of a reliable, valid and
useful performance assessment. Following evidence-centered design (Mislevy
& Haertel, 2007):
The Construct or Claim—a clear statement is required of what construct

and what facets of the construct are to be measured. For example, in
measuring critical thinking the claim is that scores on the measure tap
critical thinking. Consequently information should be provided that varies
in trustworthiness and relevance. Some of the evidence should invite “fast
thinking” and judgmental or decision making errors. The response should
be to recommend a course of action, make a decision, or make
10 SHAVELSON ET AL.
a recommendation that is based on trustworthy and relevant information

avoiding judgmental traps and biases.
The Story—the case presented sets up the problem or challenge. It set
constraints on what is at issue and what information is to be used. It
should be compelling and motivational.
Evidence—the information provided on the assessment should map directly
back to the elements of the construct or the claim—trustworthiness, etc.
Tasks—the set of tasks or items including their response formats found
in the assessment should refer back to the evidence they are intended to
provide. If multifaceted, the task doer or reader should be able to work
backward from the task(s) to the elements/pieces of information collected
as evidence bearing on the claim.
Interpretation—the analysis has two (probably more) steps:
– The method of scoring task performance should be related back to the
tasks in a consistent manner with the evidence and claim to be made
from the assessment. In the case of critical thinking, trustworthiness
and relevance of the information presented provides a basis for scoring
as does the opportunity for judgmental errors or biases.
– The scores should be modeled statistically and/or qualitatively in such a
way as to bring them to bear on the technical quality or interpretability or
claims of the assessment: reliability, validity, and utility evidence.
PILOT STUDY OF “WIND TURBINES”
To measure critical thinking and evaluate the measurement, we used a new iPAL-
developed computer-based assessment, Wind Turbines (WT). Evidence of its
reliability and validity were garnered from (1) student performance scores, (2) a
(semi-)structured cognitive interview based on the cognitive validity of score inter-
pretation, and (3) a standardized questionnaire with selected-response questions
measuring potential personal factors and contextual influence factors in test
performance.
Task Construct
The PAL task “Wind Turbine” (WT) presented here is designed to measure
critical thinking in higher education students or graduates across domains of
study in four central facets:
1. Evaluating and using information as to trustworthiness, relevance, and

judgmental error or bias proneness of sources.
2. Recognizing, evaluating, integrating and structuring arguments and their

components (such as claims, support, beliefs, assumptions or facts),
in response.
3. Recognizing and evaluating consequences of decision-making and actions.
4. Taking communicative action appropriate to deliver results in line with the
task prompt, i.e., making an evaluative judgment, explaining a decision,
recommending a course of action, suggesting a problem solution, etc. (see
also Shavelson et al., 2018).
“Wind Turbine” Performance Assessment

WT was constructed from authentic alternative energy source cases with
meaningful consequences for myriad actors depending on the decisions and
actions taken (e.g., Shavelson, Davey, Holland, Webb, & Wise, 2015). It
focuses on a decision a small Town Council has to take regarding whether to
acquire and set up wind turbines on communal land owned by some town
inhabitants. Test-takers are given a document library and asked to evaluate
available information (e.g., newspaper article, web document, wind turbine
schematic, stakeholder interests as well as selected technical, economic, terri-
torial law, settlement, and wildlife data) that varies in trustworthiness, rele-
vance and opportunity for bias or judgmental error and write an argumentative
statement to recommend a course of action. Students are asked to use only the
information provided, informed of the main scoring criteria and told that there
is no right or wrong answer but that answers can vary in their justifiability.
In addition to judging the trustworthiness and relevance of the different
data sources, test-takers need to derive arguments for or against wind turbine
acquisition and weigh and prioritize them, while taking into account possible
bias or motives for hidden agendas, such as personal profit, and consequences
for the community or individual members. Task difficulty and scoring are
fine-tuned by the nature of the information presented (trustworthiness, rele-
vance, bias/error), the number of information sources and the various points to
consider (viz. evaluation of alternative actions, additional information needed)
in making a recommendation.
Claim. WT assesses critical thinking. Validity evidence is gathered by

(1) evaluating 30 test-takers’ responses to the task and scored responses (see
Table 1) with additional evidence on the relationship of scores to responses in
(2) cognitive interviews, as well as (3) through a questionnaire with selected-
response questions.
12 SHAVELSON ET AL.
TABLE 1
Descriptive Statistics and Reliability of Average Scores on the 6-point Likert Scale in the
Performance Task (N = 30)
Internal
Standard Consistency
Statistic Variable Mean Deviation Minimum Maximum alpha
Overall (Total possible score: 138) 3.55 0.76 1.39 4.65 0.95
Rubric 1: Recognizing and 4.05 0.70 2.13 4.88 0.64
evaluating the relevance (24)
Rubric 2: Evaluating and decision 3.52 0.83 1.23 5.06 0.90
making (54)
Rubric 3: Recognizing and 2.68 0.84 1.00 4.63 0.82
evaluating consequences (24)
Rubric 4: Writing effectiveness (36) 3.84 0.94 1.33 5.25 0.91
Story. The test-taker takes the role of a student representative on the City
of Elmsen’s Council along with six others with varying backgrounds and inter-
ests. The mayor has asked the student to evaluate Ventusa’s offer to build
a park with six wind turbines. The WT is in principle licensable and its use for
the production of wind energy generally desirable. Ventusa prefers the Elmsen
location because of high wind-power density, but if Elmsen declines would
consider building in the neighboring municipality Murn.
Evidence. The test-taker has a library of 22 documents that vary in

trustworthiness, relevance and proneness to judgmental error/bias. The sources
are one or more: table, chart, figure, video, text, document excerpt, and
hyperlink to Wikipedia (see Table A1 in appendix).
Task. As a member of the municipal council, the mayor has asked the
test-taker to prepare a statement for the next meeting, which will: (1) compile
the main arguments for and against the adoption of the VENTURA proposal;
(2) formulate a reasoned decision recommendation, supported with evidence
from the documents; and (3) suggest one or two pieces of additional informa-
tion that would increase confidence in the recommendation.
Interpretation. The scores on the WT measure are interpreted as gauging

the level of test-takers’ critical thinking. Evidence bearing on reliability and
validity (cognitive, correlational) is used to examine the proposed
interpretation.
Test Administration. The test-takers complete the task on a

computer and generally needed about 60 minutes to solve the task (see
Zlatkin-Troitschanskaia et al., 2018). (Test-takers’ approaches varied, from

investigating all links to external information to only taking into account the
information in the story and not the documents.) Upon completion of this
written task, test-takers participated in a cognitive interview and, after a short
break, responded to a short questionnaire.
Sample
A total of 30 students from a German university participated in the pilot study.
Twenty-five participants were enrolled for a bachelor’s degree; five were
enrolled for a master’s degree. They aged 19 to 37 years, with an average age
of 24 years (SD = 4.03 years). Since prior education provides an indicator of
general cognitive skills (e.g., Kim & Lalancette, 2013; Schaap, Schmidt, &
Verkoeijen, 2011), we characterized test-takers by their prior education or
vocational training: 6 participants had achieved a higher education entrance
qualification at a specialized secondary school with a focus on economics; 11
had already completed commercial vocational training. Finally, 2 participants
mentioned political commitment in their community, and 3 mentioned other
social commitments, which could constitute relevant experience in the context
of solving problems such as those found in WT (Table A2 in appendix shows
the descriptive statistics of the sample).
Scoring
Analytic dimensional scoring rubrics were developed based on the construct
definition of critical thinking and the classification of the different information
sources and arguments according to relevance, trustworthiness, and judgmental
heuristics/bias. The rubrics flexibly took into account test-takers’ specific use
of information varying in trustworthiness and relevance as well as their reflec-
tion on and use or avoidance of heuristics that can facilitate or lead to errors
in judgment and decision making. The rubrics code the use of such informa-
tion in evaluating test-takers’ justifications for their recommendations for
action, evaluation of alternative courses of action and identification of add-
itional information needed. The trained raters score performance along a set of
23 correlated dimensions that are clustered into four rubrics as follows:
Rubric 1: Recognizing and evaluating the relevance, trustworthiness, and

relevance of given information with four subdimensions
Rubric 2: Evaluating and making a decision with nine subdimensions
Rubric 3: Recognizing and evaluating consequences of decision-making
and actions with four subdimensions
14 SHAVELSON ET AL.
Rubric 4: Writing effectiveness with six subdimensions
Performance on each scale was scored on a 6-point Likert-type anchored

scheme. The scheme divided students’ response texts into four dimensions
with four to nine items nested in each (23 items in total) (see an example for
the first rubric including scale anchors in Table A5 in appendix). Test-takers’
responses were randomly assigned to and scored by two of four trained raters;
the average of the two raters’ scores served as a test-taker’s score.
Cognitive Interview
Following the completion of WT, an 80-minute semi-structured cognitive
interview was conducted and taped for later transcription to: (a) examine how
participants handled the PAL task; (b) examine, students’ self-reported
thought, decision-making and response processes when performing the task,
and (c) assess how the provided information was considered, particularly in
terms of their relevance, trustworthiness, and judgmental error/bias. In the
interviews, we also gathered data on how the quantity and quality of each of
the different documents were perceived while completing the task and exam-
ined the participants’ ability to judge information source quality, in terms of
its validity, and relevance. For every single given piece of information (see
Table A1 in the appendix) the participant was asked whether they rated this
information as (ir)relevant or (in)valid.
The second part of the interview focused on individual factors that influ-
ence the decision-making and response process, for example, personal beliefs,
relevant background knowledge on the story topic, and the extent to which
test-takers knowledge or experience contradicted the sources provided; for
example, whether a general rejection of wind turbine building for ecological
reasons resulted in the selective inclusion of the given information. Particular
attention was paid to seeing whether students used their prior knowledge of
a particular domain (e.g., economics) to solve the tasks. Finally, participants
were given the opportunity to give personal feedback on the task, reflecting
how personal skills might have influenced their processing of WT.
Questionnaire
After the cognitive interview, an additional short standardized questionnaire
of approximately 10 minutes was administered. The questionnaire focused on
participants’ media use: (1) ability to judge the quality of the information/
sources, (2) use of different decision-making heuristics, (3) detailed query
of general media use, and (4) when and how they reached a decision. The
instrument was also used to describe decision-making.
Evidence Bearing on Technical Quality

Here we report evidence bearing on reliability and validity. Reliability is
examined with both classical test theory and generalizability theory. Validity
evidence comes from cognitive interviews and the relation between WT total
scores and data gathered from interviews and questionnaire scales (including
biographical data and an intelligence test).
Score Distribution and Reliability. Participants on average scored 82

out of a maximum of 138 points (23 rubrics 6 point scale), with a minimum
of 32 points and a maximum of 107 points. The scores were negatively
skewed (skewness 1.05) and peaked (kurtosis 1.22). Total score internal con-
sistency reliability was 0.95. Scores on the 6-point scale are shown in Table 1
as averages along with their corresponding coefficient alpha reliabilities.
Generalizability theory (Shavelson & Webb, 1981) was used to examine
score “reliability” in greater depth than is possible with internal consistency
alpha. We considered raters to be exchangeable (Shavelson & Webb, 1991) so
we treated rater pairs as if they were the same even though the particular rater
pairs differed over students. For this purpose, four study assistants were
trained in a rater training; and the test-takers’ answers were then distributed
amongst these four raters for two randomly selected raters to score a written
response. The student item rater G-study data were analyzed with
GENOVA 3.1. We found that with 23 items and 2 raters the generalizability
(“reliability”) coefficient for total scores to be 0.74 (with 1 rater, 0.59; with 4
raters, 0.84). The dependability coefficient (“criterion-referenced” reliability)
was 0.55, 0.69 and 0.79 with 1, 2 and 4 raters, respectively. The strong student
interactions with item and rater suggested that the measurement procedure
could be improved by enhanced rater training and/or an increased number
of items.
Validity
We had a variety of evidence against which to examine interpretations of the
performance assessment scores including cognitive interviews and correlation
analyses of survey items.
Cognitive Interview: The 30 respondents were asked when they had come to
their decision on whether to support wind turbines. Fourteen of 30 respondents
came to their decisions after reviewing the documents in the library. Eleven
16 SHAVELSON ET AL.
considered “for” and “against” arguments on the basis of the facts and information;
two decided as they wrote their responses; and one decided after finishing writing.
The qualitative analysis of cognitive interviews indicates that they also considered
the presented information and documents as for and against arguments, using what
we would define as highly critical thinking. These 14 participants performed better
(on average over 3.7 points on a 6 point scale) than students who think less critically
according to cognitive interviews (on average < 3.5 points). One respondent
reported to have based the decision on prior knowledge and performed significantly
worse results than the 14 participants (2.46 points). Seven respondents reported hav-
ing decided directly after reading the task for the first time (with or without the
accompanying documents; average score of 3.42) (see Table A3 in the appendix).
The rank-order correlation between time when decision was made and
performance scores were: 0.25, 0.06, 0.26, 0.30, and 0.22 for total score and
the four rubric scores. These correlations reflect the differences observed
among groups of test takers’ mean scores just described.
Finally, we found that test takers (with one exception, as noted) did not
make use of their prior knowledge when solving the tasks. Furthermore, there
were no statistically significant correlations between prior knowledge, study
domain and WT performance.
Questionnaire Analyses: We expected measures of achievement (high
school grade-point-average [GPA], mathematics grade, and German grade) to be
positively correlated with WT total scores. We found correlations of: 0.20, 0.13,
and 0.15, respectively. We expected the intelligence scores to positively correlate
with total WT scores and the actual result was: 0.32. These findings support the
interpretation that WT scores are influenced, as expected, by indicators of cogni-
tive ability but the correlations are low, suggesting, as assumed, that general cog-
nitive factors influence critical thinking as well.
In addition, we examined the performance assessment scores as to: gender,
degree (bachelor, master), migration background, type of high school attended
(academic/vocational), vocational training, study abroad, and socio-political
engagement (see Table A4 in the appendix). With a few exceptions, none of these
variables significantly influenced performance assessment scores. However, we
did find that socio-political commitment significantly correlated with performance
(0.41), which suggests that skills acquired during experiential activities, such as
being an active member of a political party, influence critical thinking as well.
DISCUSSION AND IMPLICATIONS FOR FURTHER iPAL

DEVELOPMENT
By critically examining the data and the first findings in the pilot study, we saw
what worked well with the newly developed performance assessment, WT, and
what did not work so well. While we could produce reliable scores on WT, and
total scores correlated positively with measures of general cognitive ability, the
think aloud data from cognitive interviews suggested that about half of test-takers
did not actually carry out the performance assessment as intended. Rather they
took the story at face value and wrote a recommendation paying scant attention to
the documents. While time pressure did not appear to be an issue with the test
takers, this may be because they did not review the documents carefully—think-
ing too fast rather than slow. One implication of this finding is that fewer docu-
ments and sharper focus in the storyline are needed in revisions.
More generally, according to the ECD (Mislevy & Haertel, 2007), the domain
analysis and modeling should be critically reexamined. The test design model is
based on the criterion sampling approach (Shavelson et al., 1974). While analyses
and descriptions of the domain in question, for example, based on job descriptions
(e.g., patrolman), can be implemented relatively well for determining job-specific
skills, a precise description is much more complicated for generic skills such as
critical thinking. The challenges became particularly evident in the construct def-
inition and test framework for WT. They also became apparent as we developed
the scoring scheme, for example, being able to precisely and distinctly divide or
weigh the individual (sub)dimensions and categories.
The reliability analyses presented here support the theoretically developed
scoring scheme with the four main categories as facets of the overall construct
critical thinking. However, latent factor analyses and cluster models as well as
multiple-category IRT models, with a larger sample, are recommended for fur-
ther studies in order to gain differentiated information about weighting of sin-
gle tasks and the test’s internal structure.
In sharpening the construct it might be helpful to consider whether
approaches developed in the theory of moral judgement are transferable to the
idea of critical thinking in general. There and here it appears necessary to bal-
ance conflicting interests (in a wider sense: aspects or arguments). Roughly
speaking, the (moral) quality of solutions was measured on two dimensions,
(1) which principle/value (e.g., [economic] advantage, [social] acceptance,
[environmental] sustainability) a position statement is based on, and (2) how
narrow or broad the participant’s perspective on consequences of a decision is
(reaching from ego and alter up to the actual and future world population).
There are sophisticated models of moral reasoning (e.g., the neo-Kohlbergian
moral judgement theory developed by Minnameier, 2001) which should be
examined if and to what extent they may add more clarity in sharpening the
construct of critical thinking.
The interrater reliability and generalizability analyses indicate that the
assessment design source of measurement error are far more complicated than
reflected in coefficient alpha. Sorting out this complexity in, for instance, a
18 SHAVELSON ET AL.
design with items nested in subareas would provide a more accurate picture of
measurement error. Moreover, regressing WT total and subscores on nature of
information (trustworthiness, relevance, judgment/bias) used by students would
bear concretely on inferences about their critical thinking. Moreover, WT test
instructions should be critically examined and possibly stated more precisely.
The role of previous knowledge and beliefs has to be accounted for by
varying the domain knowledge underlying the tasks sampled in the assessment.
Remarkably, it is the test-takers who integrate their beliefs (not (previous)
knowledge) into solving the case that achieve the highest performance test
scores. This finding raises further questions regarding the correlation between
knowledge, beliefs and critical thinking processes, which require further
examination.
All in all, the pilot-test provided a solid basis to continue working on the
more precise determination of the threshold and quality criteria for critical
thinking and on appropriate indicators for the representation of underlying
mental processes. In other words, the crucial question arises as to what extent
critical thinking is being assessed when the student is working on the perform-
ance task, compared to, for example, simple integrative information process-
ing, testwiseness and/or general abilities irrelevant to the construct. Further,
the analysis would examine whether there is “only” a single test motivation
that leads to a particularly intensive, differentiated and detailed solving of the
task or whether the test persons have specific cross-situational critical attitudes
that lead them to think critically in terms of the construct definition.
In future analyses, the focus of cognitive interviews will be on the cognitive
processes involved in real-time task solving and determine to what extent crit-
ical thinking took place—the extent to which test-takers’ cognitive processes
reflect the critical thinking the tasks are intended to stimulate. Moreover cogni-
tive analyses should examine whether respondents applied stereotype-based
fast-thinking on the basis of simple heuristics when solving the performance
task. Within the scope of an additional experimental laboratory study, we need
to examine how much time the participants need to complete the task, hence
showing their critical thinking skills. Here, eye-tracking as well as analyses of
key logfiles could provide additional relevant information.
ACKNOWLEDGEMENTS
We would like to thank Marie-Theres Nagel, Dimitri Molerov and our student
assistants for their work in carrying out the study. We would like to thank the
two anonymous reviewers and the editors, Maria Elena Oliveri and Robert
Mislevy, who provided constructive feedback and helpful guidance in the
revision of this paper.
REFERENCES
American Educational Research Association, American Psychological Association, & National

Council on Measurement in Education (AERA, AEA, & NCME). (2014). Standards for educa-
tional and psychological testing. Washington, DC: American Educational Research
Association.
Ercikan, K., & Pellegrino, J. W. (2017). Validation of score meaning for the next generation of
assessments: The use of response processes. New York, NY: Routledge.
Feinstein, N. R. (2014). Making sense of autism: Progressive engagement with science among
parents of young, recently diagnosed autistic children. Public Understanding of Science, 23(5),
592–609. doi:10.1177/0963662512455296
Fu, A. C., Raizen, S. A., & Shavelson, R. J. (2009). Education. The nation's report card: A vision
of large-scale science assessment. Science, 326(5960), 1637–1638. doi:10.1126/
science.1177780
Kahneman, D. (2011). Thinking, fast and slow. New York, NY: Farrar, Straus and Giroux.
Kim, H., & Lalancette, D. (2013). Literature review on the value-added measurement in higher
education. Paris, France: OECD.
Kosslyn, S. B., & Nelson, B. (2017). Building the intentional university. Minerva and the future
of higher education. Cambridge, MA: MIT Press.
Kuhn, C., Zlatkin-Troitschanskaia, O., Br€uckner, S., & Saas, H. (2018). A new video-based tool
to enhance teaching economics. International Review of Economics Education, 27, 24–33.
doi:10.1016/j.iree.2018.01.007
Lane, S., & Stone, C. A. (2006). Performance assessments. In R. L. Brennan (Ed.), Educational
measurement (4th ed., pp. 387–432). Westport, CT: Praeger.
McClelland, D. C. (1973). Testing for competence rather than for “intelligence”. The American
Psychologist, 28(1), 1–14.
Minnameier, G. (2001). A new stairway to moral heaven - A systematic reconstruction of stages
of moral thinking based on a Piagetian “logic” of cognitive development. Journal of Moral
Education, 30(4), 317–337. doi:10.1080/03057240120094823
Mislevy, R. J., & Haertel, G. D. (2007). Implications of evidence-centered design for educational
testing. Educational Measurement: Issues and Practice, 25(4), 6–20. doi:10.1111/j.1745-3992.
2006.00075.x
National Assessment Governing Board U.S. Department of Education (NAEP). (2009). Science
Framework for the 2009 National Assessment of Educational Progress. Retrieved from https://
www.nagb.gov/content/nagb/assets/documents/publications/frameworks/science/2009-science-
framework.pdf
OECD. (2013a). Assessment of Higher Education Learning Outcomes. Data Analysis and
National Experiences (Feasibility Study Report Vol. 2). Retrieved from http://www.oecd.org/
education/skills-beyond-school/AHELOFSReportVolume2.pdf
OECD. (2013b). Assessment of Higher Education Learning Outcomes. Further Insights
(Feasibility Study Report Vol. 3). Retrieved from http://www.oecd.org/education/skills-beyond-
school/AHELOFSReportVolume3.pdf
Oliveri, M. E., Lawless, R., & Young, J. W. (2015). A validity framework for the use and develop-
ment of exported assessments. Princeton, NJ: ETS. Retrieved from https://www.ets.org/s/about/
pdf/exported_assessments.pdf
Schaap, L., Schmidt, H., & Verkoeijen, P. J. L. (2011). Assessing knowledge growth in a psych-
ology curriculum: Which students improve most? Assessment & Evaluation in Higher
Education, 37, 1–13.
20 SHAVELSON ET AL.
Secolsky, C., & Denison, B. D. (Eds.). (2018). Handbook on measurement, assessment and evalu-
ation in higher education (2nd ed.). Oxford, UK: Routledge.
Seminara, J. L., Shavelson, R. J., & Parsons, S. O. (1967). Effect of reduced pressure on human
performance. Human Factors, 9(5), 409–418. doi:10.1177/001872086700900503
Shavelson, R. J. (2010). Measuring college learning responsibly: Accountability in a new era.
Stanford, CA: Stanford University Press.
Shavelson, R. J. (2013). On an approach to testing and modeling competence. Educational
Psychologist, 48(2), 73–86. doi:10.1080/00461520.2013.779483
Shavelson, R. J., Baxter, G. P., & Pine, J. (1991). Performance assessment in science. Applied
Measurement in Education, 4(4), 347–362.
Shavelson, R. J., Beckum, L., & Brown, B. (1974). A criterion-sampling approach to selecting
patrolmen. Police Chief, 41(9), 55–61.
Shavelson, R. J., Davey, T., Holland, P. W., Webb, N. M., & Wise, L. L. (2015). Psychometric
considerations for the next generation of performance assessment. Princeton, NJ: Educational
Testing Service.
Shavelson, R. J., & Webb, N. M. (1981). Generalizability theory: 1973–1980. British Journal of
Mathematical and Statistical Psychology, 34(2), 133–166. doi:10.1111/j.2044-
8317.1981.tb00625.x
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA:
Sage.
Shavelson, R. J., Zlatkin-Troitschanskaia, O., & Mari~ no, J. P. (2018). International performance
assessment of learning in higher education (iPAL) – Research and development. Wiesbaden,
Germany: Springer.
Stanovich, K. E. (2009). What intelligence tests miss: The psychology of rational thought (1st ed.).
New Haven, CT: Yale University Press.
Stanovich, K. E. (2016). The rationality quotient: Toward a test of rational thinking (1st ed.).
Cambridge, MA: MIT Press.
Tremblay, K., Lalancette, D., & Roseveare, D. (2012). Assessment of Higher Education Learning
Outcomes. Design and Implementation (Feasibility Study Report Vol. 1). Retrieved from http://
www.oecd.org/education/skills-beyond-school/AHELOFSReportVolume1.pdf
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science,
185(4157), 1124–1131. doi:10.1126/science.185.4157.1124
Wheeler, P., & Haertel, G. D. (1993). Resource handbook on performance assessment and meas-
urement: A tool for students, practitioners, and policymakers. Berkeley, CA: Owl Press.
Zlatkin-Troitschanskaia, O., Shavelson, R. J., & Pant, H. A. (2018). Assessment of learning out-
comes in higher education: International comparisons and perspectives. In C. Secolsky & D. B.
Denison (Eds.), Handbook on measurement, assessment and evaluation in higher education
(2nd ed., pp. 686–698). Oxford, UK: Routledge.
TABLE A1
Overview PAL Task “Wind Turbine” incl. Scoring
No. Section Content Presentation Source Scoring

[1] Scenario Description of situation Text in-house
Wind Turbine construction
[2] Task to be per- Task Text in-house
formed construction
description
[3] Task to be per- Further notes with regard Text in-house
formed additional to the argumenta- construction
description tive statement
formed additional to the given information construction
description and sources
formed additional to the time and scope of construction
description the statement
List of appendices
[6] Notes on the Name, profession, back- Table in-house relevant
members of the ground information on construction +/valid +
local council each of the seven mem-
bers of the local council
[7] Municipality Demographic structure Number in-house relevant
Elmsen – population construction +/valid +
[8] Municipality Demographic structure – Pie chart in-house relevant
Elmsen age structure construction +/valid +
[9] Municipality Demographic structure Bar chart in-house relevant
Elmsen – employment construction +/valid +
[10] Municipality Demographic structure Number in-house relevant
Elmsen – commuters construction +/valid +
[11] Municipality Demographic structure – Bar chart in-house relevant
Elmsen last general election construction +/valid +
[12] Municipality Location Text in-house relevant
Elmsen construction +/valid +
[13] Municipality History Text in-house relevant
[14] Municipality Politics/Law Text in-house relevant
[15] Municipality Economy Text in-house relevant
[16] Municipality Land utilization in the Figure in-house relevant
Elmsen three municipalities construction +/valid +
[17] Structure of a Description of each part Text as list Energienpoin- relevant - /
wind turbine of a wind turbine and figure t.de valid +
[18] Effects of wind Brief section on Text Wikipedia relevant
turbines: bird mortality „Windkraftan- +/valid -
Emission/ lage“
Immission
(Continued)
22 SHAVELSON ET AL.
TABLE A1
(Continued)
[19] Effects of wind Extensive text on Text with fig- Hyperlink to
turbines: wind turbines ures website
Emission/ and graphs Wikipedia
Immission “Windkraftan-
lage”
[20] Effects of wind Brief section on Text Wikipedia relevant
turbines: bat mortality “Windkraftan- +/valid -
Emission/ lage”
Immission
[21] Effects of wind Bats and wind turbines Text NaBu relevant
turbines: with figures +/valid -
Emission/
Immission
[22] Effects of wind Brief section on Text Sound relevant
turbines: sound emissions (Excerpt from: +/valid +
Emission/ Obermaier,
Immission H. (2011)
[23] Effects of wind Sound immissions of Text with fig- Fachagentur- relevant-/
turbines: wind turbines ures Windenergie valid+
Emission/ and graphs
Immission
[24] Effects of wind Sound immissions of Figure DEWI GmbH relevant-/
turbines: wind turbines in the sur- valid+
Emission/ rounding area
Immission
[25] Effects of wind Infrasound (detrimental Headline to Newspaper relevant
turbines: to one’s health?) the subse- Die Welt +/valid -
Emission/ quent source
Immission
[26] Effects of wind Reference newspaper Hyperlink Newspaper relevant
turbines: Die Welt to Website Die Welt +/valid -
Emission/
Immission
[27] Effects of wind Bird strike Headline to ARD relevant
turbines: following Mediathek +/valid+
Emission/ video
Immission
[28] Effects of wind Reference video (6 Min.) Hyperlink ARD relevant
turbines: to Mediathek Mediathek +/valid+
Emission/
Immission
[29] Effects of wind Video (3 Min.) Headline to ARD relevant
turbines: following Mediathek +/valid -
Emission/ video
Immission
(Continued)
TABLE A1
(Continued)
[30] Effects of wind Reference video (3 Min.) Hyperlink ARD relevant
turbines: to Mediathek Mediathek +/valid -
Emission/
Immission
TABLE A2
Sample Description (N = 30)
Frequency
Variable n %
Gender
Women 21 70.0
Men 9 30.0
Degree
Bachelor 25 83.3
Master 5 16.7
Country of birth
Germany 27 90.0
Other 1 3.3
Parents migration background
Yes 3 10.0
No 27 90.0
Highest school-leaving qualification in Germany
Yes 29 96.7
No — —
Completed vocational training
Yes 11 36.7
No 18 60.0
Completed internship
Yes 25 83.3
Not specified 5 16.7
Stay abroad
Yes 14 46.7
No 15 50.0
Social engagement
Yes 13 43.3
No 16 53.3
24 SHAVELSON ET AL.
Variable
n M SD
Age 29 23.93 4.03
Semester 30 4.77 2.46
University entry qualification grade 28 2.07 0.60
TABLE A3
Time of Decision
Average
Decision was made… Frequency test score
immediately after reading the scenario description and the task 3 3.40
for the first time (before reading the accompanying documents)
after reading the scenario description and task several times 2 3.50
immediately after reading the entire task, including the accompa- 7 3.42
nying documents, for the first time
after reading the entire task, including the accompanying docu- 1 1.98
ments, several times
after considering “for” and “against” arguments that were based 1 2.46
on prior knowledge
after considering “for” and “against” arguments that were based 2 4.32
on what you believe to be true
after considering “for” and “against” arguments based on the pre- 11 3.72
sented facts and information
while writing the statement 2 3.78
after completion of the statement 1 3.85
TABLE A4
Means of PT-Performance of Different Groups
Variable M SD
Gender
Women 3.54 0.82
Men 3.57 0.63
Degree
Bachelor 3.53 0.73
Master 3.67 0.95
Migration Background
No 3.63 0.75
Yes 2.83 0.32
Completed Vocational Training
No 3.52 0.87
Yes, commercial 3.72 0.52
Yes, non-commercial 3.28 —
Stay Abroad
Yes 3.72 0.70
No 3.45 0.78
Social Engagement
No 3.26 0.81
Yes, political 4.25 0.20
Yes, faith-related 3.85 0.37
Yes, athletic 3.74 —
Yes, other 3.90 0.58
TABLE A5 26
Dimension 1 of the Rating Scheme with Individual Categories and Behavior Anchors
Recognizing and evaluating the relevance, reliability, and validity of given information
1 2 3 4 5 6
Amount of used/consid- Did not recogniz- Used only little of Used some of the Used 3 sources of Considered 4 or Considered 4 or
ered information ably use given the given infor- given informa- the given more sources more sources
and sources information mation (1 tion information of the given of the given
source) (2 sources) information information
SHAVELSON ET AL.
and included
further
information
Accurately judges quality Only used unreli- Used some of the Used 2 sources of Used 3 sources of Used 4 or more Used 4 or more
of evidence avoiding able reliable given the relevant the reliable sources of the sources of the
or qualifying unreli- information information (1 reliable infor- given reliable reliable infor-
able, erroneous, and source) and mation and information information mation and
uncertain sources and mostly unreli- some unreli- defined unreli-
information able able able
information information information
Accurately judges quality Only used irrele- Used 1 of the Used 2 of the Used most of the Used the relevant Used only rele-
of evidence avoiding vant relevant given relevant given relevant given information vant informa-
invalid and irrelevant information information information information correctly tion and
information and mostly and some defined irrele-
irrelevant irrelevant vant
information information information
Acknowledges uncer- Does not acknow- Acknowledges Mentions and Mentions and Mentions and Mentions and
tainty and justified ledge uncer- some uncer- specifies uncer- specifies uncer- specifies uncer- specifies uncer-
and reasonable need tainty / need tainty, no men- tainty, no men- tainty, men- tainty, pre- tainty, defines
for further information for further tion of need tion of need tions need for cisely describes and justifies
information for for information more need for more needed
information information information information

Assessment of University Students' Critical Thinking: Next Generation Performance Assessment

Uploaded by

Copyright:

Available Formats

You might also like

Assessment of University Students' Critical Thinking: Next Generation Performance Assessment

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assessment of University Students' Critical Thinking: Next Generation Performance Assessment

Uploaded by

Copyright:

Available Formats

International Journal of Testing

ISSN: 1530-5058 (Print) 1532-7574 (Online) Journal homepage: https://www.tandfonline.com/loi/hijt20

Assessment of University Students’ Critical

Richard J. Shavelson, Olga Zlatkin-Troitschanskaia, Klaus Beck, Susanne

To cite this article: Richard J. Shavelson, Olga Zlatkin-Troitschanskaia, Klaus Beck,

To link to this article: https://doi.org/10.1080/15305058.2018.1543309

Published online: 24 Jan 2019.

Submit your article to this journal

View Crossmark data

Full Terms & Conditions of access and use can be found at

Assessment of University Students’

Olga Zlatkin-Troitschanskaia, Klaus Beck and Susanne Schmidt

Following employers’ criticisms and recent societal developments, policymakers

Keywords: critical thinking, criterion-sampling measurement, evidence-centered

Correspondence should be sent to Olga Zlatkin-Troitschanskaia, Gutenberg School of

The rapid expansion and diversification of higher education worldwide over

2007) and performance assessment framework (Shavelson, 2013) based on a cri-

THEORETICAL BACKGROUND ON PERFORMANCE ASSESSMENT

The notion of a performance assessment seems quite straightforward.

iPAL PERFORMANCE ASSESSMENT FRAMEWORK

thinking might be tapped in deciding on a course of action when varying sides

Construct Definition: Critical Thinking

Therefore, the universe of tasks demanding generic critical thinking com-

Facets of Critical Thinking. Critical thinking is evoked by presenting a

1. trustworthiness of the information—reliable, unreliable, uncertain;

a. Trustworthiness—trustworthy such as the Federal Aviation Report in the

thinking (“representativeness heuristic”—engineers wear pen pocket pro-

Judgmental and Decision Heuristic and Bias Sampling

 consider alternative courses of action to the one proposed and indicate

Task Format and Delivery

1. the evidence presented from provided information to justify a decision or

Moreover, a spreadsheet might be used for calculations, simulations might be

with graphs or tables provided, or rational thinking with standalone prompts

Relation to Evidence-Centered Design

iPAL Performance Assessment Development

 The Construct or Claim—a clear statement is required of what construct

a recommendation that is based on trustworthy and relevant information

PILOT STUDY OF “WIND TURBINES”

1. Evaluating and using information as to trustworthiness, relevance, and

2. Recognizing, evaluating, integrating and structuring arguments and their

“Wind Turbine” Performance Assessment

Claim. WT assesses critical thinking. Validity evidence is gathered by

Evidence. The test-taker has a library of 22 documents that vary in

Interpretation. The scores on the WT measure are interpreted as gauging

Test Administration. The test-takers complete the task on a

Zlatkin-Troitschanskaia et al., 2018). (Test-takers’ approaches varied, from

 Rubric 1: Recognizing and evaluating the relevance, trustworthiness, and

 Rubric 4: Writing effectiveness with six subdimensions

Performance on each scale was scored on a 6-point Likert-type anchored

Evidence Bearing on Technical Quality

Score Distribution and Reliability. Participants on average scored 82

DISCUSSION AND IMPLICATIONS FOR FURTHER iPAL

American Educational Research Association, American Psychological Association, & National

No. Section Content Presentation Source Scoring

You might also like

consider alternative courses of action to the one proposed and indicate

The Construct or Claim—a clear statement is required of what construct

Rubric 1: Recognizing and evaluating the relevance, trustworthiness, and

Rubric 4: Writing effectiveness with six subdimensions