10 1080@10986060701341332

MATHEMATICAL THINKING AND LEARNING, 9(2), 83–130
Copyright © 2007, Lawrence Erlbaum Associates, Inc.
Students’Appreciation of Expectation and

Variation as a Foundation for Statistical
Understanding
Jane M. Watson
Faculty of Education
University of Tasmania, Australia
Rosemary A. Callingham
School of Education
University of New England, Australia
Ben A. Kelly
Faculty of Education
University of Tasmania, Australia
This study presents the results of a partial credit Rasch analysis of in-depth interview
data exploring statistical understanding of 73 school students in 6 contextual settings.
The use of Rasch analysis allowed the exploration of a single underlying variable
across contexts, which included probability sampling, representation of temperature
change, beginning inference, independent events, the relationship of sample and pop-
ulation, and description of variation. Interpretation of the demands of increasing code
levels for the resulting variable revealed an increasing appreciation of and interaction
between the ideas of variation and expectation. Student progression in understanding
is illustrated with kidmaps, and educational implications are considered.
The importance of variation as the foundation for statistical understanding at the

school level has been increasingly recognized since Moore’s (1990) description of
“uncertainty” in relation to chance and data. Although at that time contemporary
curriculum documents did not specifically highlight variation (Australian
Correspondence should be sent to Jane M. Watson, Faculty of Education, University of Tasmania,

Private Bag 66, Hobart, Tasmania 7001, Australia. E-mail: Jane.Watson@utas.edu.au
84 WATSON, CALLINGHAM, KELLY
Education Council [AEC], 1991; National Council of Teachers of Mathematics

[NCTM], 1989), calls for research into understanding of this important facet of sta-
tistics were made by Green (1993) and later Shaughnessy (1997). This recent inter-
est, however, is juxtaposed with a traditional emphasis on expectation seen in topics
in the school curriculum. The mean, or expected value, has been prominent in the
mathematics curriculum for well over 100 years (e.g., Capel, 1885) and theoretical
probability for at least half a century (e.g., Hart, 1953). Shaughnessy (1997) sug-
gested that the reason for the emphasis on expectation may have been that the math-
ematical calculations to find the mean or a simple probability were accessible to
quite young students, whereas the standard deviation, measuring variation, was
more complex to calculate and left to later years. The growing interest in apprecia-
tion of variation across the years of schooling and the long established interest in
expectation, particularly in relation to probability (e.g., Fischbein, 1975; Green,
1983, 1991; Jones, 1974), suggest the need to consider the development of under-
standing of the interaction between the two: expectation, often expressed in proba-
bilities of outcomes or means of data sets, and variation, which for several reasons
is likely to occur about the expected pattern.
The terms expectation and variation, like many others in mathematics, have
both technical and colloquial meanings. James and James (1959, p. 151) defined
expectation in terms of the expected value a continuous function, E(x) = –x =
∞ ∞
∫−∞ xf(x)dx, and variation as σ2 = ∫−∞ (x – –x )2 f(x)dx. Such sophisticated definitions
may suggest the reason why the terms are not common in school curriculum docu-
ments. For school students, however, expectation with respect to chance and data
is likely to be experienced in terms of probabilities, averages, “caused” differ-
ences, and random distributions of outcomes, whereas variation is likely to be
experienced in relation to uncertainty, anticipated change, unanticipated change,
and outliers. In our study the terms expectation and variation are used to express
these school-curriculum-based ideas, rather than the sophisticated definitions used
by statisticians. The focus of the research presented in this article is the growing
understanding of these ideas during the school years, based on Rasch analysis of
interviews with 73 students in Australian schools. This is the first study in mathe-
matics education to analyze in-depth interview data in this fashion and it illustrates
the potential of Rasch analysis to contribute beyond the arena of survey and
achievement data. The additional information obtained from probing questions in
interviews, together with use of the Rasch Partial Credit Model (PCM) (Masters,
1982), allows a developmental model to be proposed that brings together expecta-
tion and variation in terms of school students’ development of understanding.
LITERATURE REVIEW
The earliest description of children’s development of notions of chance was

given by Piaget and Inhelder in 1951, although this was not translated and widely
APPRECIATION OF EXPECTATION AND VARIATION 85
available in English until 1975 (Piaget & Inhelder, 1975). In this work, children’s
understanding of random events and centered distributions, and the conflict seen
when children attempted to reconcile the behavior of a biased spinner with
respect to its apparently random nature were described. Three stages of develop-
ment were identified. In the first, children were unable to distinguish between “…
the possible and the necessary” (p. 216); by the second stage, children recognized
the nonpredictability of chance events that were not amenable to deductive rea-
soning; in the third stage, these ideas were reconciled and children could reason
deductively to identify outcomes and quantify these, but also accept the innate
unpredictability of chance events.
The history of more recent research into understanding of stochastic ideas has
followed the curriculum to some extent, with expectation reflected in research into
chance and probability, especially in the work of Green (e.g., 1983, 1986, 1991) and
Fischbein (e.g., Fischbein & Gazit, 1984; Fischbein, Nello, & Marino, 1991;
Fischbein & Schnarch, 1997) picking up the misconceptions identified by
Kahneman and Tversky (e.g., 1972). These classic studies focused almost entirely
on the expectations related to events in sample spaces. Although Kahneman and
Tversky implicitly recognized the importance of variation in some of their scenar-
ios—for example, the famous “hospital” problem—the explicit reference made was
to the representativeness of the sample in terms of sample size. Rubin, Bruce, and
Tenney (1991) were the first to comment specifically on students’ struggles with
expectation and variation in situations of samples of different sizes. Interest in the
mean by Mokros and Russell (1995) continued the research related to expectation
and they concluded that few students had an appreciation of the representative
nature of the mean in terms of the data set it represented.
Konold and Pollatsek (2002) used the metaphor of “noise” for variation and
“signal” for expectation. In thinking of a signal as a central tendency they meant
a stable value that (a) represents the signal in a variable process and (b) is better
approximated as the number of observations grows. … Processes with central ten-
dencies have two components: (a) a stable component, which is summarized by
the mean, for example, and (b) a variable component, such as the deviation of
individual scores around an average, which is often summarized by the standard
deviation. (p. 262)
Mainly considering examples related to measurement, either in repeated mea-

sures or in measuring individuals, Konold and Pollatsek also considered discrete
events leading to probabilities that can be considered as signals based on rates.
Although acknowledging variation, their main focus was the importance of find-
ing a signal (expectation) among the noise (variation).
In parallel with these developments and following the foundation set in place
by Moore (1990), Wild and Pfannkuch (1999) placed variation as a central com-
ponent of statistical thinking in their four-dimensional model of statistical think-
ing in the domain of professional statisticians. Reading and Shaughnessy (2004)
summarized these research developments, and interest in variation led to a forum

devoted entirely to understanding of statistical variation by students of all levels
including introductory university courses (Lee, 2003). Petrosino, Lehrer, and
Schauble (2003) considered Grade 4 students’ appreciation of variation through
data collection experiences and the creation and description of error distributions.
Based on surveys that were designed to assess students’ understanding of varia-
tion within contexts, where aspects of the mathematics curriculum addressing
chance and data were the source of content, Watson, Kelly, Callingham, and
Shaughnessy (2003) used data from 746 students in Grades 3, 5, 7, and 9 to sug-
gest a developmental model of understanding. This model, based on Rasch Partial
Credit Modeling (Masters, 1982; Rasch, 1980), suggested four hierarchical levels
of understanding. At Level 1, “Prerequisites for Variation,” students tended to be
unable to deal with the environment and were likely to use idiosyncratic stories
and personal experiences to justify responses. At Level 2, “Partial Recognition of
Variation,” students began putting ideas into context but focused on single ideas
neglecting other important aspects. Explanations for observed change in data
were more likely to focus on artificial patterns, “anything can happen,” or irrele-
vant features. Responses at Level 3, “Applications of Variation,” tended to consol-
idate ideas in context, although showing inconsistency in picking the most salient
features for consideration, for example, on questions related to variation and sam-
pling. Level 4 responses reflected “Critical Aspects of Variation,” employing
complex justifications or critical reasoning.
Some of the specific antecedents to this study are found in the frustration of
Zawojewski and Shaughnessy (2000) with a National Assessment of Educational
Progress (NAEP) survey item about expectation set in a probabilistic setting but
which did not encourage acknowledgement of possible variation in outcomes.
This led Shaughnessy, Watson, Moritz, and Reading (1999) to experiment with
various forms of the survey item that would allow for recognition of potential
variation in the context of sampling from a container of candy, with expectation
based on fixed proportions of three colors. Student responses, in particular those
related to speculation on repeated experiments, could be coded for closeness to
the expected proportion and for the reasonableness of the variation shown.
Further development of the item for use in interview settings by Torok and
Watson (2000), Reading and Shaughnessy (2000; 2004), and Kelly and Watson
(2002) led to the suggestion of developmental paths based on detailed analysis of
student reasoning. The relationship of expectation and variation was also
explored with surveys by Watson and Kelly (2003b, 2004b) for imagined rolls of
a six-sided die 60 times and for outcomes of spinning a single 50–50 spinner
many times, and by Shaughnessy, Canada, and Ciancetta (2003) for these tasks
and the candy task. In all of these studies there was growing appreciation for the
interaction between the expectation of the theoretical model and the innate appre-
ciation that variation will occur from it.
THIS STUDY
Following the analysis of survey and interview data from tasks that suggested
students had difficulty reconciling expectation and variation, it was of interest to
combine interview data from a number of tasks based in different contexts using
Rasch measurement techniques, to provide a more complete model based on in-
depth reasoning in contexts involving experimentation and extended tasks. The
aims of this study were then to:
1. identify a hierarchy for the conceptual understanding of expectation and

variation and their interaction, using outcomes of in-depth interviews
2. provide a rich description of the development through a consideration of
students’ responses to the in-depth interview tasks.
METHODOLOGY
Sample
The sample for this study consisted of 66 students selected from the sample of
746 surveyed by Watson et al. (2003), plus an extra 7 six-year-old children who
were near the end of a preparatory year of full-time schooling before entering
Grade 1 (Prep). The 66 students included 18 from Grade 3, 18 from Grade 5, 15
from Grade 7 and 15 from Grade 9, selected by the researchers for a range of
responses to the survey questions and by their teachers as being articulate and
willing to be interviewed by the researchers. The six-year-olds were chosen by
their teacher as articulate and happy to talk to the researchers; these students had
been involved in an enriched mathematics program but not one involving chance
and data. The other students had experienced a mathematics curriculum based on
state guidelines derived from A National Statement on Mathematics for
Australian Schools (AEC, 1991), although it was unknown what specific topics
they had studied in relation to chance and data.
Tasks
The tasks used in this study were based on five protocols in different contexts
involving various aspects of chance and data where variation and expectation play a
role in decision making, together with an explanation of words associated with vari-
ation. The protocols are presented in Appendix A in the order in which they were
answered by most students. The Lollies Task was developed from the work of
Shaughnessy et al. (1999) and Torok and Watson (2000), and initial analysis of out-
comes for students in this study is given in Kelly and Watson (2002). The task
involved 100 lollies (or candies) in a container, of which 50 were red, 20 yellow,
and 30 green. Students were presented with the container and asked to speculate in
various ways about the number of reds in a handful of 10 removed without looking.
They were then allowed to produce six handfuls of 10 (with replacement), asked if
they wished to change their estimates, and given the opportunity to represent the
outcomes of 40 such trials. Ideas related to expectation and variation occurred
throughout all parts of the protocol reflecting the initial concerns of Zawojewski
and Shaughnessy (2000).
The Weather Task was adapted from a protocol used by Torok and Watson
(2000) and was analyzed for the students in this study by Watson and Kelly
(2005). The protocol was based on the yearly average daily maximum tempera-
ture in Hobart, Tasmania, of 17ºC. Students were asked to explain what this
meant, suggest daily maxima throughout the year, draw a graph to represent the
maximum temperatures throughout the year, and interpret three graphs presented
to them. Initial expectation was expressed in the yearly average of 17ºC, whereas
all questions were directed at variation about this expectation and the yearly trend
in expectation as recognized by the students.
The Comparing Groups protocol employed three of four parts of the protocol
first analyzed in relation to beginning inference by Watson and Moritz (1999) and
later analyzed for explicit features of variation discussed in the responses by
Watson (2001, 2002). Based on the work of Gal and his colleagues (e.g., Gal,
Rothschild, & Wagner, 1989) students were asked to compare three pairs of
graphs, each pair showing test outcomes for two classes of children. Two pairs of
graphs represented sets of the same size, whereas the third pair represented sets
of differing size. Of interest were the methods chosen by students to decide which
classes had done better and the notice taken of variation in the graphs during deci-
sion making. Coding of responses with respect to two rubrics, one for expectation
and one for variation, was reported for a larger data set by Skalicky (2005).
The Spinners Task was adapted from the work of Zawojewski and Shaughnessy
(2000) and Shaughnessy and Ciancetta (2002), and analyzed by Watson and Kelly
(2004a). The task involved two circular spinners, each half black and half white.
The scenario was based on the chances of winning a game where it was necessary
to spin both spinners and have them land on black. Trials of the game were actually
performed with students and they were allowed to change their initial estimates of
the chances of winning. Although observation of difference from students’ stated
expectation was considered in coding, this task mainly addressed expectation
related to outcomes from the two independent spinners.
The Population/Sample Means Task was adapted from a problem of Tversky
and Kahneman (1971) and was based on the difference in mean values for a random
sample of size 10 drawn from a population and for the corresponding sample of size
9 if one of the values from the original sample is known. In this case the context
was the weight of Grade 5 students from a population with a mean of 30 kg, for a
sample of size 10, where one value was known to be 39 kg (Watson & Kelly, 2006).
Again the initial questions in the protocol were stated in terms of expectation, but
appreciation of sampling variation was instrumental in achieving sophisticated
responses and two rubrics were devised to account for both ideas.
The task associated with explanation of words related to variation was devel-
oped by the researchers and was analyzed in a similar fashion to that used by
Watson and Kelly (2003c). It included a hint associated with interpreting the sen-
tence, “The winds are variable.” All of the tasks except the last began with a ques-
tion concerning an expectation based on the context, but how variation was
observed and used in decision making or the creation of representations created
was a feature of the analysis.
The Prep students only responded to the Lollies Task and the Weather Task.
This results in their contribution to five items in the subsequent analysis.
Initial Analysis
The initial analysis of the responses to each of the protocols was informed in two
ways: by the structural taxonomy suggested by Biggs and Collis (1982, 1991) and
by the statistical appropriateness of the responses. The work of Biggs and Collis
is in the Piagetian tradition (Inhelder & Piaget, 1958), reflecting the development
of understanding observed in children as they progress through the school years.
The taxonomy presents four levels of interaction with the relevant elements of the
task presented: (a) at the prestructural or iconic level, responses do not employ
elements of the task and are likely to involve idiosyncratic reasoning; (b) at the
unistructural level, responses employ single elements and are likely not to be
aware of contradictory information; (c) at the multistructural level, responses use
multiple elements, usually in sequence, sometimes recognizing but not being able
to resolve conflicting information; and (d) at the relational level, responses inte-
grate multiple elements of the task to achieve closure, resolving any conflict
encountered. As well as being informed by this structure, the appropriateness of
the elements and their combination was important in coding. Although at times a
response could not be deemed “correct” or “incorrect” given the open-ended
nature of a question, it could be said that it was more or less statistically appropri-
ate given the nature of the task presented. The criteria for appropriateness are
described in detail in Appendix A.
The coding schemes for the tasks were developed and detailed in previous
studies (Kelly & Watson, 2002; Skalicky, 2005; Watson & Kelly, 2003c, 2004a,
2005, 2006). Codes for the Lollies Task were revised from those of Kelly and
Watson (2002) by the first and third authors according to criteria noted in the pre-
vious paragraph, and applied by the two authors independently to the response
set. Any discrepancies were discussed and resolved in a fashion consistent with
the suggestions of Miles and Huberman (1994, p. 61). Appendix A contains the
coding schemes and the frequency with which each code was observed in the
overall sample for the 11 items defined based on the protocols. Table 1 outlines
the components of the protocols presented, the labels applied, the criteria for cod-
ing, and the range of coding values possible.
Secondary Analysis
Coded data were analyzed using the Quest computer program (Adams & Khoo,
1996) employing the Partial Credit Model (PCM; Masters, 1982), with the aim of
identifying developmental pathways (Bond & Fox, 2001, Chapter 7). The rela-
tively small sample size (73 students) could lead to somewhat greater measure-
ment errors than seen with large-scale survey data, but the use of interview data
provided opportunity for more accurate coding according to the underlying devel-
opmental model employed, and this, it was felt, would provide information on
likely progression that would outweigh the disadvantage of the small sample size.
The PCM (Masters, 1982) is one of the family of Rasch measurement models.
It makes use of the interaction between persons (in this case the interview sub-
jects) and items (the 11 tasks) to determine the relative positions of all persons
and all items on the same measurement scale. The unit of measure is the logit, the
natural logarithm of the odds of success (Wright & Masters, 1982). The PCM has
been used with interview data from 58 students in relation to Piagetian tasks
(Bond & Bunting, 1995; Bond & Fox, 2001), and is regarded as a useful model
TABLE 1
Coding Criteria for Each of the Tasks
Task Label Criteria for Coding Range
Lollies (Parts 1–4) LDN Expectation and variation shown in discussion 0–4
Lollies (Part 5) LGR Expectation and variation shown in graph 0–4
Weather WDN Expectation and variation shown in discussion 0–4
(Parts 1a, 1b, 1d)
Weather WDT Consistency of variation shown in suggested 0–3
(Parts 1c, 1e, 1f, 1g) temperatures
Weather WGR Expectation and variation shown in graph produced 0–4
(Parts 1g, 2) and graph interpretation
Comparing Groups CGX Expectation in deciding differences between groups 0–5
Comparing Groups CGV Variation in deciding differences between groups 0–4
Spinners SPN Expectation in explaining outcomes 0–4
Population/ PSX Expectation in suggesting means for two samples 0–3
Sample Means
Population/ PSV Variation in suggesting means for two samples 0–4
Sample Means
Definition of Variation VDF Appreciation of variation 0–4
because it allows for a different number of coding steps for each item. In this
study the 11 tasks were coded from 0 to 3, from 0 to 4, or from 0 to 5, determined
through the initial analysis.
Several statistics, produced by the Quest program, are used to evaluate the fit of
the data to the PCM. The first of these is the Infit Mean Square (IMSQ), a weighted
measure of the extent to which the fit of the items (item IMSQ) or persons (case
IMSQ) deviates from the expected value of 1.00. Acceptable values lie between 0.77
and 1.30 (Adams & Khoo, 1996; Keeves & Alagumalai, 1999). For both items (item
IMSQ = 1.01) and persons (case IMSQ = 1.00) the overall mean values in this study
were acceptable. Individual item fit was also considered. Only two items showed
small misfit: CGV had some indication of random behavior (IMSQ = 1.39) and LDN
behaved unexpectedly consistently (IMSQ = 0.71). Complete item difficulties, mea-
surement error, and fit values are provided in Appendix B. The Separation Reliability
is a measure of how well the items (RI) or persons (RP) behave consistently. These
statistics may be interpreted as a reliability statistic, and have an ideal value of 1. For
both items (RI = 0.90) and persons (RP = 0.90), the figures were high, indicating that
the behaviors of both items and persons were consistent.
The Quest program produced a variable map of the behavior of items and
students, which was interpreted by a qualitative analysis of the skills, knowledge,
and understanding required to respond to the particular items that were clustered
close together on the variable. Item clusters were initially identified by inspecting
the variable map and noting places along the variable where there was an appar-
ent “jump” in difficulty, shown by a gap or discontinuity among the item difficul-
ties. The items occurring in each cluster were then analyzed to distinguish
common cognitive demands, based on the item coding from the initial analysis.
Finally, a short descriptor of each cluster was synthesized. Discussion and agree-
ment among the authors determined the placement of lines on the variable map to
indicate different levels of cognitive demand along the variable. This procedure is
the same as that described in other studies (Callingham & Watson, 2004, 2005). It
should be noted that the lines between the levels are not considered as “hard”
boundaries. Error of measurement means that it is not possible to draw firm divi-
sions between items and that there is potential overlap among items occurring at
the margins. Rather, the levels are a convenient device for describing consistent
behaviors across a range of tasks at different points along a continuum, and thus
provide useful information about likely patterns of development among children.
This approach has been used elsewhere to provide a profile of students’ likely
development (Griffin, 1990). The characteristics of the item clusters appearing at
each level are described in the Results section.
To illustrate the typical performance of students at the various levels, kidmaps
(Adams & Khoo, 1996) are presented. These show the most likely placement of indi-
viduals with respect to the items and how they performed in terms of what would be
expected from the difficulty of the items. The dotted lines across the maps indicate
one standard error of measurement. In general, for items falling within this range, the
individuals have approximately a 50% chance of success. For categories below this
region, the chances are higher than 50%, whereas above this region the chances are
less than 50%. Anomalies of performance shown in the upper left quadrant are those
where students achieved on an item at a level higher than might have been expected.
Conversely, those items shown in the lower right of the map are those where students
did not achieve what would have been expected. Overall, the kidmaps show the con-
sistency of students’ responses in relation to the underlying variable.
RESULTS
A Hierarchical Progression of Understanding

The fit of all items to the model (see Appendix B) indicated that the interview
tasks, coded according to an underlying developmental model, worked together
consistently to define a single hierarchical variable. The interpretation of that
variable is now considered.
Figure 1 shows the variable map for students’ statistical understanding show-
ing the placement of the 73 students relative to the interview items to which they
responded. The items on the right side of the variable map in Figure 1 form clus-
ters created by the application of the PCM to the coding used. Boundary lines
were drawn between clusters of items as described in the previous section. The
difference in difficulty at the boundary between Level 1 and Level 2 was 0.90 log-
its, 0.41 logits from Level 2 to Level 3, 0.51 logits from Level 3 to Level 4, 0.20
logits from Level 4 to Level 5, and 0.27 logits from Level 5 to Level 6.
Although the coding was hierarchical, no task had six codes that could be
expected to correspond to the six levels suggested for the variable map. Noted
with interest are the increasing demands of higher codes and how they reflect the
cognitive requirements of other item codes appearing at the same level. The brief
descriptors in Figure 1 summarize the increasing acknowledgement and facility
with expectation and variation shown in the interview responses. These are ampli-
fied and explained in the following paragraphs. These descriptions are illustrated
by kidmaps that show typical students’ responses within a particular level. Of
importance in considering the description of response categories for the tasks
along the variable map are the contexts within which the tasks were set where
students could become involved and apply their prior experience with expectation
and variation. The desire to consider various contexts within which variation
occurs was an overriding interest, and, as is seen, context appears to affect the dif-
ficulties of the items. Unless stated otherwise, examples and representations pre-
sented at any level are from different students.
FIGURE 1 Variable map for the conceptual understanding of expectation and variation.
Description of Levels
Level 1, Idiosyncratic. For most of the response categories appearing at
Level 1, iconic reasoning that was not related to issues involving expectation or
variation was likely to be displayed. For the Lollies protocol, students were
likely to explain outcomes (LDN.1) in terms of their favorite numbers, of the
position of lollies in the container, or of the sizes of their hands. Similarly for the
Spinners Task (SPN.1), explanations for observed outcomes from trials were
likely to be based on egocentric or anthropomorphic beliefs, for example, sug-
gesting “nine” black outcomes “because I’m turning nine this year.” For the
Weather protocol, explanations (WDN.1) were likely to be inconsistent across
parts, perhaps suggesting alternatives to the maximum or noting a single aspect
of the weather context but also focusing on personal experiences of cold weather
and choosing what clothes to wear. Graphs for the lowest response category for
the Weather protocol (WGR.1) consisted only of informal axes with no data or of
pictures of sun and rain; as well, there was an inability to interpret other graphs
shown in the protocol except for occasionally noting single values in one of
them. Figure 2 shows two examples of representations for the Weather Task
(WGR.1) at Level 1. For the initial response category for comparing graphs of
two data sets (CGX.1), single features were likely to be used to distinguish the
better set, for example, noticing placement along the number line for the Blue
and Red classes or the existence of a “7” for the Brown class (see Appendix A).
The lowest six response categories shown on the variable map represented an
appreciation of what the tasks were about but could go little further with expla-
nation or representation.
The kidmap in Figure 3 shows the performance of a Prep student, S1, with an
ability estimate of –3.35 logits (cf. Figure 1), placing the student in the middle
range of Level 1. For the four response categories in the range where the odds of
success are 50–50, the student was successful on 3 of 4 items, reflecting idiosyn-
cratic explanations of variation and inconsistent suggestions of temperature data,
shown by the following excerpts from the interview.
S1 (LDN): [1(a) How many reds?] 5, because 5 + 5 = 10, 1 more makes 6,

and 4 is 10. [1(b) Same every time?] No. The red up the top will
be gone. [1(c) Surprise?] 10. You might get other colors.
FIGURE 2 Two representations at Level 1 for the Weather Task (WGR.1).

FIGURE 3 Kidmap for student S1 at Level 1.
[Six trials] 4, 2, 5, 4, 1, 3 (low values, reasonable spread). If you

shake them you might not get the same amount.
S1 (WDN): [1(a) Tell about the weather?] They might be wrong. All the
time the news is wrong with the weather. [1(b) All days 17ºC?]
Sometimes. Sometimes it is raining.
S1 (WDT): [1(c), 1(d) Suggested temperatures] 10, 9, 8, 7, 6, 5, because they

are the highest.
The student did not achieve at the higher levels on any task attempted, as would be
expected from the student’s placement on the scale (as noted earlier, the seven Prep
students only responded to tasks LDN, LGR, WDN, WDT, and WGR). As shown in
Figure 2, S1 drew a picture of herself in the sun and explained what she would wear
if it were hot or cold for WGR. She also did not achieve a Code 1 response to LGR,
and her relatively poor performance in graphing tasks may reflect lack of prior
experience. In her responses, S1 typifies the behaviors expected within Level 1.
Level 2, Informal. At Level 2, response categories appeared to represent the

beginning of thought about context in relation to expectation and variation but no
indication that both might be present in that context. In comparing graphs of two
data sets (CGX.2) responses comprised a series of steps involving visual compar-
ison or totals to reach an appropriate decision for sets of equal size. Variation pre-
sent in the graphs was beginning to be considered in the decision-making process
(CGV.1), but in only one pair of graphs (Part (b) or (c)) with mention of single
columns or vague use of a phrase such as “more.” In attempting to explain or
define Variation (VDF.1), although students often claimed they had heard of the
term, responses were unlikely to be related to the concept. When asked for a way
to show the result of many repeated trials of drawing lollies (LGR.1), students
were likely to draw pictures, display single numbers, or attempt a form of graph
that was not related to the context of the task (see Figure 4).
For the second response code for graphing in the Weather protocol (WGR.2),
graphs were likely to be incomplete monthly representations or isolated tempera-
tures with no shape, and descriptions of the other graphs were likely to involve
single ideas, sometimes vague, or misinterpretations. Two graphs from this response
FIGURE 4 Two representations at Level 2 for the Lollies Task (LGR.1).

category are shown in Figure 5. These students were likely to know what the
general shape of a graph should be like but were unable to connect this with the
requirements of the task to show change over the year.
The kidmap in Figure 6 is from a Grade 3 student, S2, with an ability estimate
of –1.54 logits. This student was not asked about definitions of variation but else-
where provided Code 1 responses except for Comparing Groups (CGX.2).
Variation was acknowledged in the Spinners Task (SPN), but the explanation
given was anthropomorphic.
S2 (SPN): [Agree with Jeff 50–50 chance?] No. Because they might not land on
the same place because they don’t know whereabouts they are going
to land.
In the Comparing Groups Task coded for expectation (CGX), the response
focused on the class total only, using a stepwise calculation.
S2 (CGX): [Compare Yellow and Brown graphs. Which is better?] Brown,

because 3 plus 4, plus 4, plus 5 is 16, plus another 5 is 21, plus 5 is
26, … 32, 38, 45. [Yellow?] That’s 12, 5 plus 5, 15, … 25, … I
reckon they could be equal.
Graphing tasks produced the representation shown on the left of Figure 5 (LGR)
and the following limited response to WGR.
S2 (WGR): Writes Jan, Feb, Mar. Produces a list for Jan: 18, 9, 10, 22, 13, 18.
The student did not achieve a Code 2 on the Weather graphing task, but this result
is not unexpected, as the item falls within the range where there is a 50% chance
of success. This student was at the top of Level 2, possibly in transition to more
sophisticated thinking, in keeping with the idea that the boundaries identified are
not “hard” barriers.
FIGURE 5 Two representations at Level 2 for the Weather Task (WGR.2).

98
Level 3, Inconsistent. Two salient features of responses were observed at

Level 3. One feature was that responses at this level tended to be inconsistent but
the students did not recognize that this was happening, for example, giving ranges
for data values that were inconsistent with individual values suggested (WDT.2).
“Anything can happen” and the physical working of a spinner were two common
reasons given for expected outcomes, often couched in phrases such as “50–50”
with no specific interpretation (SPN.2). At this level, two response categories for
defining variation terms appeared (VDF.2 and VDF.3), indicating a transition from
students being unlikely to make any progress on defining terms associated with
variation unless provided with support in a sentence such as “the winds are vari-
able” (VDF.2) to being likely to suggest without prompt, one meaning, and with
help, another (VDF.3). The other major characteristic at this level was the rela-
tively consistent use of single features to describe expectation and variation. This
was seen, for example, in a focus on “more” in explaining why red lollies would
occur in samples from a population that was 50% red (LDN.2). Similarly in com-
paring two data sets, single features such as columns or a bulge in a graph meaning
“more” were likely to be seen as representing a class doing better (CGV.2). This
was also seen when some students produced a time-series type of graph showing
variation from student to student (LGR.2), or date to date (WGR.3), but little
recognition of a need to aggregate or reflect the middle or trend of a data set.
Examples of graphs for each task are given in Figure 7. For the graphing of
weather data, some students at this level could not draw a representation like
shown in Figure 7, but could interpret other representations appropriately.
Typical of the consistent students at Level 3 was the Grade 5 student, S3, with an
ability estimate of –0.72 logits, whose kidmap is shown in Figure 8. This student
achieved Code 2 responses for six tasks but did not achieve one Code 2 response in
the region of 50%-chance (LDN.2). The response to Comparing Groups (CGX,
CGV) indicates a more holistic view of the two graphs than does the response of the
previous student, S2 but, typical of this level, had a focus on “more.”
FIGURE 7 Representations at Level 3 for Lollies task (LGR.2) and Weather Task (WGR.3).
S3 (CGX, CGV): [Yellow or Brown class better?] Yellow [How did you
decide?] Because there’s a whole lot of 5s (Yellow).
There’s two 6s and two 4s. And this one (Brown) has
only got 10 [points to the 3 and the 7] … [Pink or Black
class better?] Pink, because more of the bars are higher.
A single focus on the language of variation is shown in the response to VDF.
S3 (VDF): [Have you heard “the winds are variable”?] Yes. [What does it
mean?] It is changing or something.
The anomalous outcomes for S3 were for the items on populations and samples
(PSX and PSV) where Code 1 responses were not achieved. As noted earlier,
these items were generally more difficult for students, but S3’s responses may
also reflect a lack of opportunity to learn about these ideas in any formal sense in
the primary years of schooling, as shown in the following extract.
S3 (PSX, PSV): [Next 9 children, average weight?] 189. [How did you work
it out?] Because I was trying to use the 30 and 39. [The
whole sample of 10 children?] [Pause] I don’t know. [Do
you want to use a calculator?] Yes, writes down 4287.
[How?] First I started with 189 plus 39 …
Level 4, Consistent. Two of the tasks had their highest codes appearing at
Level 4: WDT and VDF. Students were likely to recognize the need for consis-
tency in suggesting ranges in relation to data values (WDT.3) and, when specifi-
cally asked for explanations of words associated with variation, generally
provided satisfactory responses for all terms (VDF.4). They also usually com-
pared graphs of data sets of the same size successfully (CGX.3). In explaining
aspects of variation for drawing lollies from a container (LDN.3), responses
tended to refer to “more” or “half” with some appreciation of center but without a
strong appreciation of proportion across related tasks. Similarly explanations of
variation in temperatures (WDN.3) focused on comparisons between sites, alter-
natives to a maximum, or multiple aspects of weather events without focusing
explicitly on center or distribution. For graphing of repeated outcomes from
drawing lollies from a container (LGR.3), responses were likely to be a time-
series type focusing realistically on the center or to be a frequency type without
specific reference to an appropriate center. Two graphs for the Lollies Task repre-
sentative of this level are shown in Figure 9.
Among the most consistent performances at Level 4 was that of the Grade 7
student, S4, with an ability estimate of 0.86 logits, who had no anomalous results.
The kidmap is shown in Figure 10. The student focused on center in predicting
lollies outcomes but also acknowledged variation (LDN.3).
S4 (LDN): [1(a) How many red in 10?] 5 [Why?] Because there are 50 in
there and the other two, 20, 30, that equals another 50, and that’s
100 and that’s the majority of them, so you might get 5. [2(a) Six
expected outcomes] 5, 4, 6, 5, 7, 6 (centered values, reasonable
FIGURE 9 Two representations at Level 4 for the Lollies Task (LGR.3).
spread) [Why?] Well most of them are around 5 and there’s 50 in

there [container] and 20, you would get 5 red. [3(a) Multiple
choice]. (c) 5, 5, 5, 5, 5, 5 (after eliminating the other options).
[4(a) Range] 3 to 8.
The student was consistent in giving temperature values (WDT.3) but did not achieve
a Code 3 response for explaining the average temperature in Hobart (WDN).
S4 (WDN): [1(a) What does 17ºC mean?] It is cold. [Anything else?] It is not
really a hot place, it is a more of a lower temperature place to live
in. [1(b) All days 17ºC?] No. [Why?] Well on summer days …
this year we had a couple up to 30… and in the winter it has been
cold like 7 … and 12. [1(c) Suggested temperatures] 23, 31, 13,
19, 29, 27 [1(d) Explain choice of 6 temperatures] An average day
… not a warm day, just an in-between day, bit of a cold day.
On the Comparing Groups Task (CGV), S4 achieved a Code 2 response, but did
not reach the higher level Code 3 that fell within the 50%-chance zone.
S4 (CGV): [Yellow or Brown class better?] Exactly the same (added scores).
[Pink or Black class better?] Pink. They had more 6s, had more
5s, more 4s, more 3s.
Level 5, Distributional. At Level 5 students were usually successful in

relating expectation and variation in a context involving a single data set and
often included discussion of variation in their responses with no prompting. In
doing so, explanations were likely to mention variation about a center as well as
ideas such as range. Five tasks, those for the Lollies, Weather, and Spinners proto-
cols, had their highest codes at Level 5. For the Lollies protocol (LDN.4), the
explanation was likely to include mention of shape (although the term distribution
was seldom used) with focus on proportional reasoning. Similarly for the Weather
protocol (WDN.4), responses generally focused explicitly on variation away from
the average maximum temperature. In the graphing task for the Lollies protocol
(LGR.4), graphs showing the appropriate shape of the relevant distribution were
likely to be drawn, although often with too much variation. Two examples are
shown in Figure 11. For the Weather protocol (WGR.4), graphs showed the
appropriate shape and variation throughout the year and the other graphs were
described correctly in terms of meaning and variation shown. Two graphs are
shown in Figure 12.
For the Spinners Task two codes appeared at this level, Code 3 and Code 4.
Students appeared to learn from the trialing of the spinners after initially suggest-
ing 50–50 outcomes, and some students were able to use this prompt to reach the
higher level of response (SPN.4). For others, however, responses were unlikely
to be quantitatively appropriate (SPN.3). In comparing graphs of two data sets
(CGX), responses were likely to be successful in determining the better
groups when the groups were of unequal size, using a single feature of the graphs,
either the mean or the visual proportional aspect but not both. Generally at Level 5,
FIGURE 11 Two representations at Level 5 for the Lollies task (LGR.4).

FIGURE 12 Two representations at Level 5 for the Weather task (WGR.4).
variation was appreciated in various contexts and this was stated appropriately in
relation to proportions (e.g., Lollies and Spinners protocols) and to averages (e.g.,
Weather protocol) and to an inference with a single measure (e.g., Comparing
Groups protocol). The intuitive dilemma of reconciling expectation and variation
was likely to be resolved for tasks based in straightforward situations.
At Level 5, a Grade 7 student, S5, with an ability estimate of 1.95 logits,
reached the highest response category for four of the tasks (LDN.4, LGR.4,
CGV.4, VDF.4), but performed unexpectedly poorly on suggesting consistent
temperatures (WDT). This discrepancy is difficult to explain but may relate to a
lack of interest or familiarity with the weather topic. The kidmap for this student’s
responses is shown in Figure 13. The response to the Lollies Task (LDN) showed
the student’s understanding of expectation and variation.
S5 (LDN): [1(a) How many reds in 10?] Probably about 5. [Why?] … If you
choose 10, we have … half of the 100 is red. So I expect if you
pulled out 10, half of that many … would be red. [2(a) Six
106
repeated outcomes] 4, 5, 6, 3, 4, 6 (centered values, reasonable

spread) [Why?] Because 4, 5, 6, are around the halfway mark. So
I thought maybe once they might get less than what you would
expect so I put 3. [3(a) Multiple choice] (b) 3, 7, 5, 8, 5, 4
[Why?] … because (b) is more the average. You have got a cou-
ple above, a couple below, and a couple of times the same.
The response to the Weather Task (WDT), however, showed little appreciation of
the context, and did not take account of the information provided about the aver-
age temperature.
S5 (WDT): [1(c), 1(d) Suggested temperatures] 31, 29, 27, 30, 26, 25, to
give a wide range of the possibilities because quite often you
have a very cold day but then you have very hot days and so the
rest are just spread out through the middle to show that they are
all different and you can get different temperatures. [1(e)
Highest and lowest maximums for the whole year] 32 and 25.
[1(f) Highest and lowest maximums for January] 17 and 10.
[1(g) Highest and lowest maximums for July] 23 and 20.
In contrast, the response to the Comparing Groups Task showed sophisticated

understanding recognizing the different influences of expectation and variation.
S5 (CGX, CGV): [Yellow or Brown class better?] I would have a look at the
scores. We have got 1, 2, 3, 4, 5 (Yellow). These guys
(Brown) have got less people getting the average but have
got more variety. Although they (Brown) have got less than
the Yellow class, the lowest person got less than the Yellow
class but they have also got a higher rate (Brown). So I
reckon they are about equal but … if I just look at it like
that I reckon they are around about equal but I would have
to see to be exact. … These people (Brown) got 3 too, so I
would just take that part out (Yellow) and then say these
people (Brown) got exact from there (Yellow) so I would
count this and this and these people (left, right, middle,
Brown) and those two (Yellow 5s) which is 10 so they got
equal. [Pink or Black class better?] I would have a look at
the highest and lowest scores. So the highest being 9
(Pink), the highest being 9 (Black). Four people in the sec-
ond highest which was 8 (Pink), four people here (Black).
Their third highest (Black) six people getting 7, same there
(Pink). Their lowest got 2 and 2 (Pink/Black), and then this
is probably where the more people come in to count (Pink)

because … more people got 3, more people got 4, more
people got 5, more people got 6. But I would have to aver-
age it out because they (Pink) have got more people.
Level 6, Comparative distributional. The four items at this level are based
on two protocols, representing the highest codes for the tasks, which each
required comparing and contrasting of two data sets (two graphs of data or two
samples). For Comparing Groups the two items required both visual comparison
and the use of means in comparing two groups (CGX), as well as an integrated
comparison of global features of variation for the two groups (CGV). The other
two items, for Population/Sample Means, required an understanding of the
sample mean as a representation of the population mean in the two sample sizes
(PSX), as well as the explanation of multiple aspects of potential variation associ-
ated with the values in the two samples (PSV).
Only one student performed at Level 6. This student, S6, with an ability esti-
mate of 4.66 logits, was only unsuccessful at the highest response category on the
explanation of variation in the weather (WDN.4). This Grade 7 student’s kidmap is
shown in Figure 14. Of particular interest are responses to the Population/Sample
Task, where the student recognized and reconciled the expected value (average)
with likely variation.
S6 (PSX, PSV): [The next 9 children, average weight?] Umm, about, around
29, 30. [Why?] That’s just an average, it could be anything
but because the first one’s 39, I wouldn’t expect it to be more
than 30 but the average weight would be well over that one
because it is just a small sample. Yes, like they could be a lot
lighter, so could be, yes, could be anything but it would be
around there as an average. [Sample of 10?] Around 31, 32.
[Why?] Because the 39 you always know—you first know
that one’s a bit heavier than the average already, so if they’re
on average he would probably be about 31 all up … just a bit
heavier.
Similarly, on Comparing Groups (CGX, CGV) the student was able to take a
global perspective, using all the information.
S6 (CGX, CGV): [Yellow or Brown class better?] They were about even
because they (Yellow) had more on 5 and they (Yellow)
didn’t have any 3s, but they (Brown) had a 7 and they
(Yellow) didn’t have a 7, so it is pretty much exactly even.
[Pink or Black class better?] The black class scored a bit
109
better because they got more higher but there’s many more
students in this (Pink) so on an average they (Pink) would
be about 5 or 6. On an average they (Black) would be about
a 6 so they would be around even. [How would you find
out?] Average all the scores up to get it to an average score.
Add all the scores together and divide it by how many.
Overall the kidmaps demonstrate the range of individual performance

observed. The students whose kidmaps are shown, responded in ways consistent
with the general description of the variable, but with some unique differences,
suggesting different developmental pathways are possible.
The level reached by each student was determined on the basis of where the
logit value of the student’s ability estimate fell along the continuum. The students
overall showed some improvement in level of performance by grade up to Grade
7, but the overlap of achievement is marked. As shown in Table 2, all Prep
students appear at Level 1 or 2, the majority of Grade 3 students are at Level 3,
Grade 5 students at Level 3 or 4, and Grades 7 and 9 students at Level 4 or 5. This
distribution reflects the difficulty of consistently reaching the highest category of
response in all tasks for most students.
DISCUSSION
Following the work of Bond and Bunting (1995) using Piaget’s pendulum task,
this study adds to the evidence supporting the use of Rasch analysis in develop-
mental research.
The use of Rasch modeling in the Bond and Bunting (1995) research showed
the value that the concept of order has within a framework of unidimensional-
ity. Interpretations of item and item step order as well as person order are
clearly central in developmental and educational research, with clear implica-
tions for measuring physical skill acquisition and medical rehabilitation as well.
TABLE 2
Levels of Performance Across Grades
Prep Grade 3 Grade 5 Grade 7 Grade 9 Total
Level 1 4 2 0 0 0 6
Level 2 3 3 1 0 0 7
Level 3 0 10 11 2 1 24
Level 4 0 3 6 9 10 28
Level 5 0 0 0 3 4 7
Level 6 0 0 0 1 0 1
Hand in hand with a clear concept of the variable under examination is the
Rasch concept of unidimensionality. Although this might seem a little esoteric
to some, the point is an important one in the application to Rasch measurement,
especially in novel settings. (Bond & Fox, 2001, p. 103)
There are three factors that can influence the error estimates and, hence, confi-
dence in the outcomes of a partial credit analysis (Bond & Fox, 2001, p. 100).
These factors are (a) a small sample size, (b) the item difficulties being off target
for the population and showing a ceiling or floor effect, and (c) a large number of
response codes per item that can lead to poor discrimination between categories.
Although the sample size and number of items in this study were not large, the
overall fit for the 11 items was good (see Appendix B), with only some indication
of random behavior for CGV and of unexpectedly consistent performance for
LDN. Errors of measurement (shown in Appendix B) were not unduly large,
despite the small sample size, and in keeping with the size of the errors in the
Bond and Bunting study. The variable map (Figure 1) shows a reasonable distrib-
ution of items and students. There is a slight “tail” in the students’ distribution,
but there are items at low levels that did allow these students to demonstrate what
they were able to do. Similarly, at the other end of the scale, there is a cluster of
items that demanded high levels of response. Item and case (person) separation
reliabilities were high indicating that the tasks provided a wide ranging variable,
and that the students were spread out along it. Overall, based on the three criteria
of Bond and Fox, there can be confidence in the outcomes from the analysis.
Variable Interpretation
The decisions on the placement of boundaries in the variable map were based on
qualitative interpretations of the demands of the tasks and the jumps in difficulty
between adjacent item clusters. This approach was one way of segmenting the con-
tinuum to provide a usable description for teachers that was not unduly detailed, but
inevitably it meant that there were some compromises. Generally, for each individ-
ual item the increasing codes appeared in different identified levels of the variable,
the exceptions being PSV.2 and PSV.3, and SPN.3 and SPN.4, which appeared in
Level 5, and VDF.2 and VDF.3, appearing in Level 3. The appearance of both codes
within the same level suggests that there is a relatively small jump in understanding
between the two codes. In particular, PSV.2 focused on “balancing” alone, whereas
PSV.3 recognized a wider range of sources of variation in the task. It is likely that,
once students begin to balance possible outcomes, they are demonstrating early
understandings of the links between expectation and variation, providing a basis for
teacher intervention. In the Spinners Task, given that the spinners presented were
50–50 spinners, it may be that it was relatively easy for students to achieve a theo-
retical solution (Code 4). From the responses to VDF, it would seem that being able
to demonstrate appropriate understanding of variation only when offered a familiar

context (Code 2) can be a scaffold to the higher level response of appropriate
description (Code 3). In common measurement practice, codes that are close
together along the variable would be collapsed. It is the opinion of the authors, how-
ever, that the additional information gathered by including all codes can provide
additional insight into students’ development of the concepts underlying the unidi-
mensional variable.
The Student Sample

Across the grades, the choice of students to be interviewed may have reflected
students with higher ability, except for Grade 9 where the teachers suggested that
the classes chosen for participation in the study were of “average” ability.
Although this may have resulted in fewer students performing at Levels 5 and 6,
the range of performance for students with more exposure to the mathematics
curriculum suggests that teachers of these grades must take into account a wide
range of students’ ability in their teaching methods. The relatively small sample
size also means that the distribution of students across levels by grade should be
viewed as tentative, subject to further research.
Hypothesized Developmental Progression

The Rasch PCM analysis of students’ responses to interview tasks coded to reflect
increasing structural complexity provided a unidimensional variable, interpreted
in terms of students’ development of ideas of expectation and variation. The inter-
view outcomes could be described in terms of increasing appreciation of variation
and expectation with six distinctly identified levels that provide a means of
describing development. The range of responses observed suggests the possibility
of a developmental progression for students in Grade 3 to 9. Longitudinal inter-
views have supported this conjecture (Watson, 2001; Watson & Moritz, 2000,
2003). The overall development of ideas within chance and data contexts appears
to be related to the growing appreciation of expectation and variation, specifically
discussed with respect to particular tasks (e.g., Shaughnessy & Ciancetta, 2002;
Watson & Kelly, 2003b, 2004a, 2004b, 2006) and eventually consolidated when
the interaction of the two becomes understood.
The levels identified in this study are summarized in Figure 15 in terms of a
model of the development of ideas of expectation and variation, and the interac-
tions between them. Expectation is associated with proportions, probabilities,
averages, caused or observed differences, and random distributions. Variation
includes some of the same ideas considered from a different perspective: uncer-
tainty, change, anticipated change, unanticipated change, and random behavior.
As can be seen in Figure 15, it is hypothesized that students begin with little
recognition of either expectation or variation indicated by faint print and broken
FIGURE 15 Suggested pathway for development of understanding of expectation and variation.
113
lines around the terms at Level 1. Students then appear to take on a primitive
appreciation of the ideas at Level 2 with some single descriptors but no interac-
tion of ideas. At Level 3, “more” becomes a proportional focus. “Anything can
happen,” however, is an explanation in chance settings for outcomes that vary
from those expected as “more.” This explanation is also used in data settings (e.g.,
with temperatures). There is little acknowledgement, however, of links between
the ideas. Gradually students then take on the idea of expectation as center or
trend, depending on the setting, and a more sophisticated idea of variation as
“small change” rather than “any change.” This beginning of appreciation of inter-
action between the two ideas is indicated at Level 4 with a broken arrow. At Level
5 students begin to resolve the dilemma of reconciling expectation and variation
in a single setting by recognizing their connections with each other, indicated
with a solid arrow, and can question surprising outcomes of one with respect to
the other. Finally at Level 6, they can do this when asked to compare and contrast
data sets, indicated by crossing arrows.
It is of interest that this model has similarities to the stages of development
proposed by Piaget and Inhelder (1975). Whereas the earlier work focused on
experiments with different random generators, the study reported here included
more “real world” examples, which allowed for the social context of the questions
to be considered. At Level 6, students could reconcile opposing ideas regardless
of context, whereas at lower levels familiarity with the context in which the items
were based appeared to play a part. The inclusion of context is in keeping with the
thrust towards statistical literacy and reinforces the interaction of context with the
mathematical ideas underlying statistical literacy that has been described in ear-
lier studies based on survey data (Callingham & Watson, 2005; Watson &
Callingham, 2003; Watson et al., 2003).
Educational Implications
It may be deemed impractical in a classroom to carry out interviews with
individual students to uncover the details of understanding that this study has
done based on the protocols in Appendix A. Some of the questions however can
be used to structure discussion and activities in the classroom that will elicit
students’ beliefs and understandings. The protocols can be adapted for group
work and report-writing for assessment purposes. If as suggested by the analyses
in this paper, the tasks present a viable way of deciphering levels of student
understanding, then their use in the classroom to assist students to reach higher
levels of performance should be encouraged. Watson and Shaughnessy (2004)
made practical suggestions for the use of the Lollies and Comparing Groups pro-
tocols that incorporate the importance of proportional reasoning in relation to the
expectation aspects of the tasks.
Following Steen’s (1988) claim that mathematics is the “science of patterns,”

recent changes to the mathematics curriculum (NCTM, 1989, 2000) suggest
increased work with pattern to enhance students’ appreciation of pattern in
mathematical settings. In practice, however, pattern work is usually related to gener-
alized arithmetic leading to algebra rather than to an appreciation of data in context.
This fact may have implications for teaching in the area of data handling: teachers
should appreciate that pattern as expectation needs to be considered beyond ideas of
“more” or seasonal changes, for example, if students are to develop to the highest
levels of understanding of variation. Explicitly comparing and contrasting the use of
pattern in pre-algebra and expectation in data settings may be useful, as suggested
for example by Watson and Kelly (2003a) in the context of interpreting a pictograph.
Teaching intervention is most likely to be successful when targeted at the point
where students have about a 50% chance of success, that is, in the central area of the
kidmaps. Students have a good understanding of concepts that appear at the lower
levels and have a basis for progression to higher levels, although they have not yet
completely mastered the concepts. Targeting teaching at this level provides for the
consolidation of necessary underpinning knowledge and leads to progression to the
next level. Students who, for example, can pick out “more” consistently, are ready
to learn about trends in data, such as central tendencies. Such an understanding will
need to be explicitly targeted by strategic teaching intervention that allows for
application of ideas about variation within a range of contexts. Although it is
unlikely that most teachers will have access to tools such as kidmaps, matching
students’ responses for these interview tasks (or similar activities used in a class-
room setting) to the descriptions of the levels will provide an indication of the
students’ levels of response sufficient for deciding a direction for further teaching.
Follow-up work along these lines has been extensively trialed with classes
from Grade 6 to Advanced Placement Statistics, employing variations on the
Lollies protocol used here (J. M. Shaughnessy, personal communication, 16
August, 2004). Observation of videotaped classroom dialogue and interaction
indicates a range of initial understanding consistent with that described here, as
well as progress in reasoning associated with extensive experimentation of sam-
pling from populations in the classroom. Further research and analysis of these
protocols in different settings (e.g., surveys, interviews, and classrooms) will con-
tribute to confirmation of the hypothesis arising from this study.
ACKNOWLEDGMENTS
This research was funded by the Australian Research Council, Grant numbers
A00000716 and DP0208607. The authors thank the referees for helpful sugges-
tions in revising this article.
REFERENCES
Adams, R. J., & Khoo, S. T. (1996). Quest: Interactive item analysis system. Version 2.1 [Computer
software]. Melbourne: Australian Council for Educational Research.
Australian Education Council. (1991). A national statement on mathematics for Australian schools.
Carlton, Vic: Author.
Biggs, J. B., & Collis, K. F. (1982). Evaluating the quality of learning: The SOLO taxonomy. New York:
Academic Press.
Biggs, J. B., & Collis, K. F. (1991). Multimodal learning and the quality of intelligent behaviour. In
H. A. H. Rowe (Ed.), Intelligence: Reconceptualization and measurement (pp. 57–76). Hillsdale,
NJ: Lawrence Erlbaum Associates, Inc.
Bond, T. G., & Bunting, E. M. (1995). Piaget and measurement III: Reassessing the méthode clinique.
Archives de Psychologie, 63, 231–255.
Bond, T. G., & Fox, C. M. (2001). Applying the Rasch Model: Fundamental measurement in the
human sciences. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Callingham, R. A., & Watson, J. M. (2004). A developmental scale of mental computation with part-
whole numbers. Mathematics Education Research Journal, 16(2), 69–86.
Callingham, R., & Watson, J. M. (2005). Measuring statistical literacy. Journal of Applied Measurement,
6(1), 19–47.
Capel, A. D. (1885). Catch questions in arithmetic & mensuration and how to solve them. London:
Joseph Hughes.
Fischbein, E. (1975). The intuitive sources of probabilistic thinking in children. Dordrecht: D. Reidel.
Fischbein, E., & Gazit, A. (1984). Does the teaching of probability improve probabilistic intuitions?
An exploratory research study. Educational Studies in Mathematics, 15, 1–24.
Fischbein, E., Nello, M. S., & Marino, M. S. (1991). Factors affecting probability judgements in
children and adolescents. Educational Studies in Mathematics, 22, 523–549.
Fischbein, E., & Schnarch, D. (1997). The evolution with age of probabilistic, intuitively based mis-
conceptions. Journal for Research in Mathematical Education, 28, 96–105.
Gal, I., Rothschild, K., & Wagner, D. A. (1989, April). Which group is better?: The development of
statistical reasoning in elementary school children. Paper presented at the meeting of the Society
for Research in Child Development, Kansas City, MO.
Green, D. R. (1983). A survey of probability concepts in 3000 pupils aged 11–16 years. In D. R. Grey,
P. Holmes, V. Barnett, & G. M. Constable (Eds.), Proceedings of the First International Conference
on Teaching Statistics (Vol. 2, pp. 766–783). Sheffield, England: Teaching Statistics Trust.
Green, D. R. (1986). Children’s understanding of randomness: Report of a survey of 1600 children aged
7–11 years. In R. Davidson & J. Swift (Eds.), Proceedings of the Second International Conference on
Teaching Statistics (pp. 287–291). Victoria, BC: The Organizing Committee, ICOTS2.
Green, D. (1991). A longitudinal study of pupils’ probability concepts. In D. Vere-Jones (Ed.),
Proceedings of the Third International Conference on Teaching Statistics. Vol. 1. School and gen-
eral issues (pp. 320–328). Voorburg, The Netherlands: International Statistical Institute.
Green, D. (1993). Data analysis: What research do we need? In L. Pereira-Mendoza (Ed.), Introducing
data analysis in the schools: Who should teach it? (pp. 219–239). Voorburg, The Netherlands:
International Statistical Institute.
Griffin, P. (1990). Profiling literacy development: Monitoring the accumulation of reading skills.
Australian Journal of Education, 34, 290–311.
Hart, W. L. (1953). College algebra (4th ed.). Boston: D. C. Heath.
Inhelder, B., & Piaget, J. (1958). The growth of logical thinking: From childhood to adolescence.
(A. Parsons & S. Milgram, Trans.). New York: Basic Books.
James, G., & James, R. C. (Eds.). (1959). Mathematics dictionary. Princeton, NJ: D. Van Nostrand
Company, Inc.
Jones, G. A. (1974). The performances of first, second and third grade children on five concepts of
probability and the effects of grade, I.Q. and embodiments on their performances. Unpublished
doctoral thesis. Bloomington: Indiana University.
Kahneman, D., & Tversky, A. (1972). Subjective probability: A judgement of representativeness.
Cognitive Psychology, 3, 430–454.
Keeves, J. P., & Alagumalai, S. (1999). New approaches to measurement. In G. N. Masters &
J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 23–42).
Oxford: Pergamon.
Kelly, B. A., & Watson, J. M. (2002). Variation in a chance sampling setting: The lollies task. In
B. Barton, K. C. Irwin, M. Pfannkuch, & M. O. J. Thomas (Eds.), Mathematics education in the
South Pacific (Proceedings of the 26th annual conference of the Mathematics Education Research
Group of Australasia, Auckland, Vol. 2, pp. 366–373). Sydney, NSW: MERGA.
Konold, C., & Pollatsek, A. (2002). Data analysis as the search for signals in noisy processes. Journal
for Research in Mathematics Education, 33, 259–289.
Lee, C. (Ed.). (2003). Reasoning about variability: Proceedings of the Third International Research
Forum on Statistical Reasoning, Thinking, and Literacy [CD-ROM]. Mt. Pleasant, MI: Central
Michigan University.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.
Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded sourcebook (2nd
ed.). Thousand Oaks, CA: Sage.
Mokros, J., & Russell, S. J. (1995). Children’s concepts of average and representativeness. Journal for
Research in Mathematics Education, 26, 20–39.
Moore, D. S. (1990). Uncertainty. In L. S. Steen (Ed.), On the shoulders of giants: New approaches to
numeracy (pp. 95–137). Washington, DC: National Academy Press.
National Council of Teachers of Mathematics. (1989). Curriculum and evaluation standards for
school mathematics. Reston, VA: Author.
National Council of Teachers of Mathematics. (2000). Principles and standards for school mathemat-
ics. Reston, VA: Author.
Petrosino, A. J., Lehrer, R., & Schauble, L. (2003). Structuring error and experimental variation as
distribution in the fourth grade. Mathematical Thinking and Learning, 5(2&3), 131–156.
Piaget, J., & Inhelder, B. (1975). The origin of the idea of chance in children. (L. Leake Jr., P. Burrell,
& H. D. Fishbein, Trans.). New York: W.W. Norton and Company. (Original work published 1951)
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago:
University of Chicago Press (Original work published 1960)
Reading, C., & Shaughnessy, M. (2000). Student perceptions of variation in a sampling situation. In
T. Nakahara & M. Koyama (Eds.), Proceedings of the 24th annual conference of the International
Group for the Psychology of Mathematics Education (Vol. 4, pp. 89–96). Hiroshima, Japan:
Hiroshima University.
Reading, C., & Shaughnessy, M. (2004). Reasoning about variation. In J. Garfield & D. Ben-Zvi (Eds.),
The challenge of developing statistical literacy, reasoning, and thinking (pp. 201–226). Dordrecht:
Kluwer.
Rubin, A., Bruce, B., & Tenney, Y. (1991). Learning about sampling: Trouble at the core of statistics.
In D. Vere-Jones (Ed.), Proceedings of the Third International Conference on Teaching Statistics:
Vol. 1. School and general issues (pp. 314–319). Voorburg, The Netherlands: International
Statistical Institute.
Shaughnessy, J. M., Canada, D., & Ciancetta, M. (2003). Middle school students’ thinking about vari-
ability in repeated trials: A cross-task comparison. In N. A. Pateman, B. J. Dougherty, & J. T. Zilliox
(Eds.), Proceedings of the 27th conference of the International Group for the Psychology of
Mathematics Education held jointly with the 25th conference of PME-NA (Vol. 4, pp. 159–165).
Honolulu, HI: Center for Research and Development Group, University of Hawaii.
Shaughnessy, J. M., & Ciancetta, M. (2002). Students’ understanding of variability in a probability

environment. In B. Phillips (Ed.), Proceedings of the Sixth International Conference on Teaching
Statistics: Developing a statistically literate society, Cape Town, South Africa [CD-ROM].
Voorburg, The Netherlands: International Statistical Institute.
Shaughnessy, J. M., Watson, J., Moritz, J., & Reading, C. (1999, April). School mathematics students’
acknowledgment of statistical variation. In C. Maher (Chair), There’s more to life than centers.
Presession Research Symposium, 77th Annual National Council of Teachers of Mathematics
Conference, San Francisco, CA.
Skalicky, J. (2005). Assessing multiple objectives with a single task in statistics. In P. Clarkson,
A. Downton, D. Gronn, M. Horne, A. McDonough, R. Pierce, & A. Roche (Eds.), Building connec-
tions: Theory, research and practice (Proceedings of the 28th annual conference of the Mathematics
Education Research Group of Australasia, Melbourne, pp. 688–695). Sydney: MERGA.
Steen, L. A. (1988). The science of patterns. Science, 240, 611–616.
Torok, R., & Watson, J. (2000). Development of the concept of statistical variation: An exploratory
study. Mathematics Education Research Journal, 12, 147–169.
Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin,
76(2), 105–110.
Watson, J. M. (2001). Longitudinal development of inferential reasoning by school students.
Educational Studies in Mathematics, 47, 337–372.
Watson, J. M. (2002). Inferential reasoning and the influence of cognitive conflict. Educational
Studies in Mathematics, 51, 225–256.
Watson, J. M., & Callingham, R. A. (2003). Statistical literacy: A complex hierarchical construct.
Statistics Education Research Journal, 2(2), 3–46.
Watson, J. M., & Kelly, B. A. (2003a). Inference from a pictograph: Statistical literacy in action. In
L. Bragg, C. Campbell, G. Herbert, & J. Mousley (Eds.), Mathematics education research:
Innovation, networking, opportunity (Proceedings of the 26th annual conference of the Mathematics
Research Group of Australasia, Geelong, pp. 720–727). Sydney, NSW: MERGA.
Watson, J. M., & Kelly, B. A. (2003b). Predicting dice outcomes: The dilemma of expectation versus vari-
ation. In L. Bragg, C. Campbell, G. Herbert, & J. Mousley (Eds.), Mathematics education research:
Innovation, networking, opportunity (Proceedings of the 26th annual conference of the Mathematics
Education Research Group of Australasia, Geelong, pp. 728–735). Sydney, NSW: MERGA.
Watson, J. M., & Kelly, B. A. (2003c). The vocabulary of statistical literacy. In Educational research,
risks, & dilemmas: Proceedings of the joint conferences of the New Zealand Association for
Research in Education and the Australian Association for Research in Education [CD-ROM].
Auckland, New Zealand, December, 2003. Available at http://www.aare.edu.au/03pap/alpha.htm
Watson, J. M., & Kelly, B. A. (2004a). Expectation versus variation: Students’ decision making in a
chance environment. Canadian Journal of Science, Mathematics and Technology Education, 4,
371–396.
Watson, J. M., & Kelly, B. A. (2004b). Statistical variation in a chance setting: A two-year study.
Watson, J. M., & Kelly, B. A. (2005). The winds are variable: Students’ intuitions about the weather.
School Science and Mathematics, 105, 252–269.
Watson, J. M., & Kelly, B. A. (2006). Expectation versus variation: Students’ decision making in a sam-
pling environment. Canadian Journal of Science, Mathematics and Technology Education, 6, 145–166.
Watson, J. M., Kelly, B. A., Callingham, R. A., & Shaughnessy, J. M. (2003). The measurement of
school students’ understanding of statistical variation. International Journal of Mathematical
Education in Science and Technology, 34, 1–29.
Watson, J. M., & Moritz, J. B. (1999). The beginning of statistical inference: Comparing two data sets.
Watson, J. M., & Moritz, J. B. (2000). The longitudinal development of understanding of average.
Mathematical Thinking and Learning, 2(1&2), 11–50.
Watson, J. M., & Moritz, J. B. (2003). Fairness of dice: A longitudinal study of students’ beliefs and
strategies for making judgments. Journal for Research in Mathematics Education, 34, 270–304.
Watson, J. M., & Shaughnessy, J. M. (2004). Proportional reasoning: Lessons from research in data
and chance. Mathematics Teaching in the Middle School, 10, 104–109.
Wild, C. J., & Pfannkuch, M. (1999). Statistical thinking in empirical enquiry. International Statistical
Review, 67, 223–265.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA
Press.
Zawojewski, J. S., & Shaughnessy, J. M. (2000). Data and chance. In E. A. Silver & P. A. Kenney
(Eds.), Results from the Seventh Mathematics Assessment of the National Assessment of
Educational Progress (pp. 235–268). Reston, VA: National Council of Teachers of Mathematics.
APPENDIX A: TASKS, CODES, EXAMPLES, INFIT

MEAN SQUARE VALUES AND FREQUENCIES
LDN (Lollies Discussion) IMSQ = 0.71
1. Suppose you have a container with 100 lollies in it. 50 are red, 20 are yellow, and 30 are
green. The lollies are all mixed up in the container. You pull out 10 lollies.
(a) How many reds do you expect to get?
(b) Suppose you did this several times. Do you think this many would come out every
time? Why do you think this?
(c) How many reds would surprise you? Why do you think this?
2. Suppose six of you do this experiment.
What do you think is likely to occur for the numbers of red lollies that are written down?
______, ______, ______, ______, ______, ______ Why do you think this?
3. Look at these possibilities that some students have written down for the numbers they
thought likely.
(a) 5,9,7,6,8,7 (b) 3,7,5,8,5,4 (c) 5,5,5,5,5,5 (d) 2,3,4,3,4,4
(e) 7,7,7,7,7,7 (f) 3,0,9,2,8,5 (g) 10,10,10,10,10,10
Which one of these lists do you think best describes what might happen? Why do you think
this?
4. Suppose that 6 students did the experiment. What do you think the numbers will most likely
go from and to?
From __________ (lowest) to __________ (highest) number of reds. Why do you think this?
Now try it for yourself: ______, ______, ______, ______, ______, ______
Given the results, do you want to change any of your previous answers?
Code Description N = 73
4 Distributional reasoning· 8
• Strong appreciation of proportion
• Consistency across questions
3 “More” or “half” red with centered reasoning 18
• Intuitive acknowledgment of center
• Partially consistent over questions
• No strong appreciation of proportion
2 “More red” but inconsistent reasoning 25
• No explicit mention of proportion
• An attempt to justify choices – inconsistent over different questions
1 Intuitive iconic reasoning 19
• Favorite numbers
• Guessing
• Location of lollies in container/size of hand
• Outcome approach
0 Limited reasoning· 3
• Minimum of 3 “no” responses (e.g., don't know, no reason)
• All other responses iconic
LGR (Lollies Graphing) IMSQ = 0.86
5. Suppose that 40 students pulled out 10 lollies from the container, wrote down the number of
reds, put them back, mixed them up.
(a) Can you show what the number of reds look like in this case? (Use the blank space below)
(b) Now use the graph below to show what the number of reds might look like for the 40
students [axes provided on next page].
Code Description N = 72
4 An appropriate distribution with a peak around “5”. Reference to the center, 6

acknowledgment of variation, and discussion of the shape of the distribution.
3 Without axes, logical time series graphs that focus on the center. With axes, 11
data focused around “5”. Some reference to the center and variation in
discussion.
2 Mixed performance including: 22

• Without axes, logical time series graphs that do not focus on proportion
but do appreciate variation. With axes, no engagement.
• Without axes, no meaningful representations. With axes, data that are
focused on the center.
• Without axes, meaningful representation in the form of a list of reason-
able numbers, a table or primitive graph. With axes, data that are
focused on the center.
1 Various inappropriate representations including: 23

• Single numbers, pictures, or idiosyncratic diagrams.
• Primitive graphs, tables or lists of possible numbers, but with no focus
on the center.
• Graphs without consideration for the context of the question or for pro-
portion.
0 Single numbers or pictures. 10
WDN (Weather Discussion) IMSQ = 1.01
1. Some students watched the news every night for a year, and recorded the daily maximum tem-
perature in Hobart. They found that the average maximum temperature in Hobart was 17 °C.
(a) What does this tell us about the temperature in Hobart?
(b) Do you think all the days had a maximum of 17 °C? - Why or why not?
(c) (What do you think the maximum temperature in Hobart might be for 6 different days
in the year?)* ______, ______, ______, ______, ______, ______
(d) Why did you make these choices?
*Part (c) is not part of WDN but essential to understanding Part (d)
Code Description (Watson & Kelly, 2005) N = 73
4 Focus on appropriate variation over the majority of the questions. 8
3 A combination of responses including a focus on appropriate variation 18

combined with single ideas or multiple features and comparisons of
temperature.
2 Focus on multiple aspects with some single features of temperature. 15
1 Focus on single ideas of temperature and/or personal experiences of 28

weather.
0 No engagement with task and no response to two out of the three questions. 4
WDT (Weather Data) IMSQ = 1.25
1. Some students watched the news every night for a year, and recorded the daily maximum tem-
perature in Hobart. They found that the average maximum temperature in Hobart was 17 °C.
c) What do you think the maximum temperature in Hobart might be for 6 different days in
the year?______, ______, ______, ______, ______, ______
e) For the whole year, what do you think the highest and lowest daily maximum tempera-
ture in Hobart would be? highest maximum _____ lowest maximum ____
f) For the month of January, what do you think the highest and lowest daily maximum
temperature in Hobart would be? highest maximum _____ lowest maximum ____
g) For the month of July, what do you think the highest and lowest daily maximum tem-
perature in Hobart would be? highest maximum _____ lowest maximum ____
Code l Description (Watson & Kelly, 2005) N = 73
3 Consistent over all four parts. 20
2 Consistent on all parts except one. 23
1 Partially consistent between items in the predictions on all or at least two 27

parts of the protocol.
0 No consistency between items in the predictions or consistent on one 3

prediction only.
WGR (Weather Graphing) IMSQ = 0.85
1. Some students watched the news every night for a year, and recorded the daily maximum
temperature in Hobart. They found that the average maximum temperature in Hobart
was17 °C.
2. Here are some ideas from other students. What do you think of them?
(a)
(b)
(c)
4 Appropriate interpretation of variation and trend presented in graphs 12

combined with the ability to draw a graph with relevant features
of yearly change.
3 Mixed performance including either: 28

• Appropriate interpretation of variation presented in one of the
presented graphs combined with the ability to draw a graph
with change but not trend.
• Appropriate interpretation of variation and trend presented in graphs
combined with the inability to produce more than an informal
graph or labeled axes.
2 Inconsistent performance in terms of interpretation and graph production. 17
1 Focus on single features of presented graphs and production of informal 9

graphs or graphs that show change but not trend.
0 Misinterpretation of presented graphs combined with the inability to 3

produce more than labeled axes.
CGX/CGV (Comparing Groups - Expectation/Variation)
Two schools are comparing some classes to see which is better at spelling.
a) Number of People
Number of People
Now look at the scores of all students in each class, and then decide. Did the two classes score
equally well, or did one of the classes score better? Explain how you decided.
b) Number of People
Number of People
Did the two classes score equally well, or did one of the classes score better? Explain how you
decided.
c) Number of People
Number of People
Again look at the scores of all students in each class, and then decide. Did the two classes score
equally well, or did one of the classes score better? Explain how you decided.
CGX (Comparing Groups – Expectation) IMSQ = 1.06
Code Description (Skalicky, 2005; Watson & Moritz, 1999) N = 66

5 Second cycle – Relational
• All available information from visual comparisons and calculation of 1
means integrated to support a response in comparing groups of
unequal sample size.
4 Second cycle – Multistructural
• Multiple step visual comparisons or numerical calculations (mean)
performed in sequence on a proportional basis to compare groups.
Second cycle – Unistructural
• Single visual comparisons used appropriately in comparing 4
groups of unequal size.
3 First cycle – Relational
• All available information integrated for a complete response
for simple group comparisons.
• Appropriate conclusions restricted to comparing groups of equal size. 11
2 First cycle – Multistructural
• Multiple step visual comparisons or numerical calculations 38
in sequence on absolute values for simple equal size
group comparisons.
1 First cycle – Unistructural
• Single features of the graph used in simple equal size group 10
comparisons.
0 Prestructural
• No focus on specific features. 2
CGV (Comparing Groups – Variation) IMSQ = 1.39
Code Description (Skalicky, 2005) N = 66

4 Global Focus evident in (b) and/or (c):
• Multiple features considered: integrated, compared and contrasted. 3
3 Multiple Features evident in (b) and/or (c):
• More than two columns considered (but only columns).
• Multiple features considered: global or global plus columns, 24
sequential analysis.
2 Single Features evident in (b) AND (c): 21
• Single column(s) considered: less than or equal to two, or no synthesis.
• “More” with no justification.
1 Single Features evident in (b) OR (c): 11
• Single column(s) considered: less than or equal to two, or no synthesis.
• “More” with no justification.
0 No acknowledgement of variation. 7
SPN (Spinners – Expectation) IMSQ = 0.86
The two fairs spinners shown below are part of a carnival game. A player wins a prize only
when both arrows land on black after each spinner is spun once.
Jeff thinks he has a 50–51 chance of winning.

a) Do you agree? (Circle one) Yes No Explain your answer.
b) If he played the game 10 times, how many times would you except him to win? Why?
c) Now play it 10 times and record your wins and losses.
WIN LOSS
GAME 1
GAME 2
GAME 3
GAME 4
GAME 5
GAME 6
GAME 7
GAME 8
GAME 9
GAME 10
TOTAL
d) How does this compare with what you thought in Part (b)?
Code Description (Watson & Kelly, 2004a) N = 66
4 Relational
• Appropriate and theoretical reasoning and understanding of 6
independent events when predicting outcomes (a and b) and when
explaining outcomes observed from trials (d).
3 Multistructural
• Intuitive reasoning of independent events expressed in light of the 5
experimental outcome when explaining outcomes observed in trial (d).
2 Unistructural
• A focus on how the spinner is used when predicting outcomes (a and b) 37
and when explaining outcomes observed from trial (d).
• A focus on chance (50–50) or “anything can happen” when predicting
(a and b) and when explaining outcomes (d).
1 Iconic
• Intuitive beliefs when predicting outcomes (a and b), however, 16
egocentric or anthropomorphic views when explaining outcomes (d).
0 No engagement with the context. 2
PSX/PSV (Population/Sample Means – Expectation/Variation)
Let’s say that the average weight for Grade 5 children over the whole of Tasmania is 30 kg. A
researcher randomly chooses a sample of 10 Grade 5 children from the state. The first child
chosen weighs 39 kg.
(a) Now think about just the next 9 children in the sample.What do you think their average
weight will be?
Please explain your answer.
(b) Now think about the whole sample of 10 children together. What do you think their
average weight will be?
Please explain your answer.
PSX (Population/Sample Means – Expectation) IMSQ = 1.21

_ _
3 x 9 = 30, x 10 > 30
Recognizes relationship and resolves appropriately in terms of 3
sample and population values.
_ _ _ _ _
2 x 9 > 30, x 10 > x 9; OR x 9 < 30, x 10 = 30 _ _
Recognizes the need to have the relationship x 9 < x 10 but does not resolve 14
appropriately in terms of the sample/population relationship.
_ _ _ _ _ _
1 x 9 = x 10, greater than 30 OR x 9 = x 10, equal to 30 OR x 9 < 30, x 10 > 30 17
Does not recognize the contradictions inherent in the estimations.
0 No idea, or refusal to guess, or value for one sample only. 4

PSV (Population/Sample Means – Variation) IMSQ = 0.93
4 Additional perspectives on variation: Further from Level 3, inclusion 2

of aspects related to the sample and population.
3 Several perspectives on observed variation: Listing of various aspects 2

of variation present in the task, perhaps including balancing.
2 Balancing: Recognition of two sets (size 9 and 10) and the need to 3
compensate for the known value in some fashion.
1 Intuitions about variation: Focus on single aspects (perhaps repeated). 28
0 No indication of variation in task. 3
VDF (Defining Variation) IMSQ = 0.96
Definitions [explanations solicited]

(a) Do you know what the word “Variation” means?
(b) Have you heard the word “Variable”? Do you know what it means?
(c) Sometimes I hear on the weather, “the winds are variable”. Do you know what this means?
4 Relational 14
• Appropriate description of “Variation”.
• Appropriate description of “Variable” without the need for a
familiar context (winds).
3 Multistructural 20
• Appropriate description of “Variation”.
• Appropriate description of “Variable” when offered in a
2 Unistructural 7
• Unable to appropriately describe “Variation”.
• Appropriate understanding of “Variable” only when offered in a
1 Prestructural 12
• Unable to appropriately describe “Variation”.
• Inappropriate understanding of “Variable” even when offered in a
familiar context (wind).
0 Cannot offer a response. 5

130
APPENDIX B: ITEM DIFFICULTIES AND ERRORS OF MEASUREMENT, AND INFIT MEAN SQUARE AND
OUTFIT MEAN SQUARE VALUES FOR EACH CODE OF EACH ITEM
Code 1 Err 1 Code 2 Err 2 Code 3 Err 3 Code 4 Err 4 Code 5 Err 5 Infit Outfit Infit Outfit
LEX −4.13 0.88 −0.86 0.51 0.67 0.49 2.04 0.55 0.71 0.73 −2.02 −1.4
LGR −2.19 0.63 −0.04 0.47 1.15 0.52 2.17 0.62 0.86 0.94 −0.82 −0.2
WDN −3.81 0.81 −0.12 0.47 0.64 0.49 2.02 0.56 1.01 1.15 0.11 0.74
WDT −4.19 0.88 −0.28 0.48 1.08 0.5 1.25 1.31 1.66 1.33
WGR −3.69 0.94 −1.43 0.63 −0.12 0.53 1.75 0.52 0.85 0.87 −0.9 −0.56
CGX −3.09 0.97 −1.27 0.65 1.3 0.61 2.27 0.78 3.86 1.47 1.06 1.08 0.36 0.39
CGV −1.63 0.63 −0.55 0.54 0.62 0.5 3.21 0.86 1.39 1.42 2.05 1.72
SPN −3.25 0.97 −0.75 0.55 1.58 0.61 2.05 0.64 0.86 0.79 −0.66 −0.77
PSX −1.31 0.78 0.72 0.64 2.8 0.93 1.21 1.24 0.98 0.89
PSV −1.84 0.88 1.6 0.88 1.92 0.95 2.54 1.13 0.93 0.61 −0.08 −0.85
VDF −1.72 0.69 −0.28 0.51 0.11 0.52 1.38 0.52 0.96 1.01 −0.17 0.14

10 1080@10986060701341332

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1080@10986060701341332

Uploaded by

Copyright:

Available Formats

MATHEMATICAL THINKING AND LEARNING, 9(2), 83–130

Copyright © 2007, Lawrence Erlbaum Associates, Inc.

Students’Appreciation of Expectation and

The importance of variation as the foundation for statistical understanding at the

Correspondence should be sent to Jane M. Watson, Faculty of Education, University of Tasmania,

Education Council [AEC], 1991; National Council of Teachers of Mathematics

The earliest description of children’s development of notions of chance was

Mainly considering examples related to measurement, either in repeated mea-

summarized these research developments, and interest in variation led to a forum

1. identify a hierarchy for the conceptual understanding of expectation and

Task Label Criteria for Coding Range

A Hierarchical Progression of Understanding

S1 (LDN): [1(a) How many reds?] 5, because 5 + 5 = 10, 1 more makes 6,

FIGURE 2 Two representations at Level 1 for the Weather Task (WGR.1).

FIGURE 3 Kidmap for student S1 at Level 1.

[Six trials] 4, 2, 5, 4, 1, 3 (low values, reasonable spread). If you

S1 (WDT): [1(c), 1(d) Suggested temperatures] 10, 9, 8, 7, 6, 5, because they

Level 2, Informal. At Level 2, response categories appeared to represent the

FIGURE 4 Two representations at Level 2 for the Lollies Task (LGR.1).

S2 (CGX): [Compare Yellow and Brown graphs. Which is better?] Brown,

FIGURE 5 Two representations at Level 2 for the Weather Task (WGR.2).

Level 3, Inconsistent. Two salient features of responses were observed at

FIGURE 8 Kidmap for student S3 at Level 3.

A single focus on the language of variation is shown in the response to VDF.

FIGURE 9 Two representations at Level 4 for the Lollies Task (LGR.3).

spread) [Why?] Well most of them are around 5 and there’s 50 in

FIGURE 10 Kidmap for student S4 at Level 4.

Level 5, Distributional. At Level 5 students were usually successful in

FIGURE 11 Two representations at Level 5 for the Lollies task (LGR.4).

FIGURE 12 Two representations at Level 5 for the Weather task (WGR.4).

repeated outcomes] 4, 5, 6, 3, 4, 6 (centered values, reasonable

In contrast, the response to the Comparing Groups Task showed sophisticated

is probably where the more people come in to count (Pink)

Overall the kidmaps demonstrate the range of individual performance

Prep Grade 3 Grade 5 Grade 7 Grade 9 Total

to demonstrate appropriate understanding of variation only when offered a familiar

The Student Sample

Hypothesized Developmental Progression

Following Steen’s (1988) claim that mathematics is the “science of patterns,”

Shaughnessy, J. M., & Ciancetta, M. (2002). Students’ understanding of variability in a probability

APPENDIX A: TASKS, CODES, EXAMPLES, INFIT

LDN (Lollies Discussion) IMSQ = 0.71

LGR (Lollies Graphing) IMSQ = 0.86

4 An appropriate distribution with a peak around “5”. Reference to the center, 6

2 Mixed performance including: 22

1 Various inappropriate representations including: 23

0 Single numbers or pictures. 10

WDN (Weather Discussion) IMSQ = 1.01

Code Description (Watson & Kelly, 2005) N = 73

4 Focus on appropriate variation over the majority of the questions. 8

3 A combination of responses including a focus on appropriate variation 18

1 Focus on single ideas of temperature and/or personal experiences of 28

WDT (Weather Data) IMSQ = 1.25

Code l Description (Watson & Kelly, 2005) N = 73

3 Consistent over all four parts. 20

2 Consistent on all parts except one. 23

1 Partially consistent between items in the predictions on all or at least two 27

0 No consistency between items in the predictions or consistent on one 3

WGR (Weather Graphing) IMSQ = 0.85

Code Description (Watson & Kelly, 2005) N = 69

4 Appropriate interpretation of variation and trend presented in graphs 12

3 Mixed performance including either: 28

2 Inconsistent performance in terms of interpretation and graph production. 17