Behavioral Observation

Chapter 12
Behavioral Observation
Roger Bakeman and Vicenç Quera
Preliminaries behavioral observations can be employed for either

type of study in either setting.
Like the 18th-century historian William Douglass
No matter the type of measurement, a key feature
who wrote, “As an historian, every thing is in my
of any quantitative investigation is its design. As we
province” (1760, p. 230), the present-day behavioral
plan an investigation and think forward to later data
scientist could say, “Every thing I know and do
analysis, it is important at the outset to specify two
begins with observing behavior.” More convention-
key components: our basic analytic units and our
ally, however, behavioral observation is simply and
research factors. Research factors usually are
primarily about measurement, which is why you
described as between-subjects (e.g., gender with two
find the present chapter in Measurement Methods,
levels, male and female) or within-subjects (e.g., age
Part III of this volume.
with repeated observations at 18, 24, and 30
months). Between-subjects analytic units are, for
Behavioral Observation Is Measurement example, the individual participants, parent–child
Measurement, as you have learned from other dyads, families, or other groups (often called cases
chapters in Part III, is understood as the act of in standard statistical packages, subjects in older lit-
assigning numbers to things: persons, events, time erature, or simply basic sampling units), whose
intervals, and so forth. Measurement is inherently scores are organized by our between-subjects
quantitative, which means that most chapters in this research factors. When repeated measures exist,
handbook are potentially relevant to users of obser- additional analytic units, each identified with a level
vational measurement. For example, studies often of a repeated measure, are nested within cases.
are categorized either as correlational or experimen-
tal (contrast Sampling Across People and Time with Observational Sessions Are
Designs Involving Experimental Manipulations, Analytic Units
Volume 2, Parts II and IV, this handbook).True, An observational session is a sequence of coded
many experimental studies are performed in labora- events for which continuity generally can be
tories, and behavioral observations often are assumed (although either planned or unplanned
employed in field settings not involving manipula- breaks might occur during an observational session).
tion. As a result, sometimes nonexperimental stud- For behavioral observation, sessions are equated
ies are referred to as “observational,” as though with analytic units. Statistics and indexes derived
observational were a synonym for correlational, from the coded data for an observational session
and are assumed to occur outside laboratories. In constitute scores; scores from the various subjects
fact, correlational studies can be performed in labo- and sessions are then organized by any between- and
ratories and experimental ones in the field—and within-subjects factors and are analyzed subsequently
DOI: 10.1037/13619-013
APA Handbook of Research Methods in Psychology: Vol. 1. Foundations, Planning, Measures, and Psychometrics, H. Cooper (Editor-in-Chief)
207
Copyright © 2012 by the American Psychological Association. All rights reserved.
Bakeman and Quera
using conventional statistical techniques as guided spent playing with their infant each day—only
by the design of the study. observational methods will work. And even when
Speaking broadly, designs are of two types: they are verbal, we may nonetheless use observa-
single-subject (as described in Volume 2, Chapter 31, tional methods when studying nonverbal behavior
this handbook) and group. As noted in the previous specifically. Not surprisingly, many early examples
paragraph, the factors of group designs may be of behavioral observation are found in studies of
between-subjects, within-subjects, or both. For animal behavior and infant development.
more information about the analysis of group Second, investigators often choose behavioral
designs, see Volume 3, Chapters 1 to 25, this hand- observation because they want to assess naturally
book (especially Chapters 8 to 11); for information occurring behavior. The behavior could be observed
about issues relevant for quantitative analysis gener- in field or laboratory settings but presumably is
ally, also see Chapters 32 to 37 of this volume. “natural,” reflecting the participant’s proclivities and
No matter the design and no matter whether scores untutored repertoire and not something elicited, for
are derived from behavioral observation or other example, by an experimenter’s task. From this point
measurement techniques, the basic psychometric of view, filling out a questionnaire is “unnatural”
properties of validity and reliability need to be estab- behavior; it only occurs in contrived, directed set-
lished (see Chapters 32 and 33 of this volume; reli- tings and is never spontaneous. Still, you might ask,
ability issues unique to behavioral observation, how natural is observed behavior? Like observer
particularly observer agreement, are discussed later effects in physics, doesn’t behavior change when it is
in this chapter). observed? Does awareness of being observed alter
Many measurement methods are simple and effi- our behavior? The answer seems to be that we habit-
cient. What is a person’s weight? Step on a scale. uate rapidly to observation, perhaps more so than in
What is a person’s age? Ask him or her. What is a earlier years now that security cameras are every-
person’s self-esteem? Have the person rate several where. For example, for Bakeman’s dissertation
items on a 1-to-5 scale and then compute the aver- research (Bakeman & Helmreich, 1975), marine sci-
age rating. In contrast, behavioral observation is entists living in a space-station-like habitat 50 feet
often time-consuming. Observational sessions can below the surface of Coral Bay in the Virgin Islands
vary from a few minutes to several hours during were on camera continuously, yet as they went
which human observers need be present. Better (or about their work awareness of the cameras seem-
worse), sessions can be recorded, which in spite of ingly disappeared within the first several minutes of
its advantages can take even more time as observers their 2- to 3-week stay in the habitat.
spend hours coding a few minutes of behavior. The Third, when investigators are interested in
data collected can be voluminous and their analysis process—how things work and not just outcomes—
seemingly intractable. Why bother with such a time- observations capture behavior unfolding in time,
consuming method? which is essential to understanding process. A good
example is Gottman’s work on marital interaction
Reasons for Using Observational (1979), which predicted whether relationships
Measurement would dissolve on the basis of characterizations of
Several sorts of circumstances can lead an investiga- moment-to-moment interaction sequences. One fre-
tor to observational measurement, but three stand quently asked process question concerns contin-
out. First, behavioral observation is useful when gency. For example, when nurses reassure children
nonverbal organisms such as human infants, nonhu- undergoing a painful procedure, is children’s distress
man primates, or other animals are being studied. lessened? Or, when children are distressed, do nurses
We cannot ask them whether they strongly disagree reassure them more? Contingency analyses designed
or agree somewhat with a particular statement nor to answer questions like these may be one of the
can we ask them to fill out a daily diary saying how more common and useful applications of observa-
much they ate or drank or how much time they tional methods, and is a topic we return to later.
208
In sum, compared with other measurement say only that length is important. Coding schemes
methods (e.g., direct physical measurement or say these specific behaviors and particularly these
deriving a summary score from a set of rated items), distinctions are worth capturing; necessarily coding
observational measurement is often labor-intensive schemes reflect a theory about what is important
and time-consuming. Nonetheless, observational and why. Bakeman and Gottman (1986, 1997)
measurement is often the method of choice when wrote that using someone else’s coding scheme was
nonverbal organisms are studied (or nonverbal like wearing someone else’s underwear. In other
behavior generally); when more natural, spontane- words, the coding schemes you use should reflect
ous, real-world behavior is of interest; and when your theories and not someone else’s—and when
processes and not outcomes are the focus (e.g., you make the connections between your theories
questions of contingency). As detailed in subsequent and codes explicit, you clarify how the data you col-
sections of this chapter, behavioral observation lect can provide clear answers to the research ques-
requires some unique techniques, but in common tions that motivated your work in the first place.
with other measurement methods, it produces Where then do coding schemes come from?
scores attached to analytic units (i.e., sessions) that Many investigators begin with coding schemes that
are organized by the within- and between-subjects others with similar interests and theories have used
factors of a group design or the within-subjects fac- and then adapt them to their specific research ques-
tors of a single-subject design. Consequently, if this tions. In any case, developing coding schemes is
chapter interests you, it is only a beginning; you almost always an iterative process, a matter of suc-
should find at least some chapters in almost all parts cessive refinement, and is greatly aided by working
of this three-volume handbook relevant to your with video recordings. Pilot testing may reveal that
interests and worth your attention. Longer treat- codes that seemed important simply do not occur,
ments are also available: Bakeman and Gottman or distinctions that seemed important cannot be reli-
(1997) provided a thorough overview, Martin and ably made (the solution is to lump the codes), or
Bateson (2007) emphasized animal and ethological that the codes seem to miss important distinctions
studies, Yoder and Symons (2010) may be especially (the solution is to split original codes or define new
appealing to those concerned with typical and atypi- ones). In its earlier stages especially, developing and
cal development of infants and children, and Bake- refining coding schemes is qualitative research (see
man and Quera (2011) emphasized data analysis. Volume 2, Chapters 1 to 13, this handbook).
Mutually Exclusive and Exhaustive

Coding Schemes: Measuring
Sets of Codes
Instruments for Observation
In addition to content (e.g., the fit with your research
Measurement requires a measuring instrument. Such questions), the structure of coding schemes can also
instruments are often physical; clocks, thermome- contribute to their usefulness. Consider the three
ters, and rulers are just a few of many common simple examples given in Figure 12.1. The first cate-
examples. In contrast, a coding scheme—which con- gorizes activity on the basis of Bakeman’s dissertation
sists of a list of codes (i.e., names, labels, or catego- research (Bakeman & Helmreich, 1975; as noted
ries) for the behaviors of interest—is primarily earlier, studying marine scientists living in a space-
conceptual. As rulers are to carpentry—a basic and station-like habitat underwater); it is typical of cod-
essential measuring tool—coding schemes are to ing schemes that seek to describe how individuals
behavioral observation (although as we discuss, spend their day (time-budget information). The sec-
trained human observers are an integral part of the ond categorizes infant states (Wolff, 1966), and the
measuring apparatus). third categorizes children’s play states (Parten,
As a conceptual matter, a coding scheme cannot 1932). Each of these coding schemes consists of a set
escape its theoretical underpinnings—even if the of mutually exclusive and exhaustive (ME&E) codes.
investigator does not address these explicitly. Rulers This is a desirable and easily achieved property of
209
Bakeman and Quera
Activity Infant state Play state over, different coders can be assigned different sets,
1. doing scientific 1. quiet alert 1. unoccupied which gives greater credibility to any patterns we
work 2. crying 2. onlooker detect later between codes in different sets.
2. at leisure 3. fussy 3. solitary play We do not want to minimize the effort and hard
3. eating 4. REM sleep 4. parallel play
work usually required to develop effective coding
4. habitat-maintenance 5. deep sleep 5. associative play
5. self-maintenance 6. cooperative play schemes—many hours of looking, thinking, defin-
6. asleep ing, arguing, modifying, and refining can be
involved—but if the result is well-structured (i.e.,
Figure 12.1. Three examples of coding schemes;
each consists of a set of mutually exclusive and exhaus- consists of several sets of ME&E codes each of
tive codes. which characterizes a coherent dimension of inter-
est), then subsequent recording, representing, and
coding schemes, one that often helps clarify our analysis of the observational data is almost always
codes when under development and that usually sim- greatly facilitated. To every rule, there is an excep-
plifies subsequent recording and analysis. Still, when tion. Imagine that we list five codes of interest, any
first developing lists of potential codes, we may note of which might co-occur. Should we define five
codes that logically can and probably will co-occur. sets each with two codes: the behavior of interest
This is hardly a problem and, in fact, is desired when and its absence? Or should we simply list the five
research questions concern co-occurrence. Perhaps codes and ask observers to note the onset and off-
the best solution is to assign the initial codes on our set times for each (assuming duration is wanted)?
list to different sets of codes, each of which is ME&E Either strategy offers the same analytic options,
in itself; this has a number of advantages we note and thus it is a matter of taste. As with the fewer
shortly. versus more combination codes question in the
Of course, the codes within any single set can be previous paragraph, a good rule is, whatever your
made mutually exclusive by defining combinations, observers find easiest to work with (and are reli-
and any set of codes can be made exhaustive simply able doing) is right.
by adding a final code: none of the above. For exam- Codes, which after all are just convenient labels,
ple, if a set consisted of two codes, infant gazes at do not stand alone. The coding manual—which
mother and mother gazes at infant, adding a third gives definitions for each code along with examples—
combination code, mutual gaze, would result in a is an important part of any observational research
mutually exclusive set, and adding a fourth nil code project and deserves careful attention. It will be
would make the set exhaustive. Alternatively, two drafted as coding schemes are being defined and
sets each with two codes could be defined: mother thereafter stands as a reference when training cod-
gazes at infant or not, and infant gazes at mother or ers; moreover, it documents your procedure and can
not. In this case, mutual gaze, instead of being an be shared with other researchers.
explicit code, could be determined later analytically.
Which is preferable, two sets with two codes each or Granularity: Micro to Macro Codes
one set with four codes—or more generally, more One dimension worth considering when develop-
sets with few if any combination codes, or fewer sets ing coding schemes is granularity. Codes can vary
but some combination codes? This is primarily a from micro to macro (or molecular to molar)—
matter of taste or personal preference—similar from detailed and fine-grained to relatively broad
information can be derived form the data in either and coarse-grained. As always, the appropriate
case—but especially when working with video level of granularity is one that articulates well
records, there may be advantages to more versus with your research concerns. For example, if you
fewer sets. Coders can make several passes, attend- are more interested in moment-to-moment
ing just to the codes in one set on each pass (e.g., changes in expressed emotion than in global emo-
first mother then infant), which allows them to tional state, you might opt to use the fine-grained
focus on just one aspect of behavior at a time. More- facial action coding scheme developed by Paul
210
Ekman (Ekman & Friesen, 1978), which relates Infant vocalization Maternal response
different facial movements to their underlying 1. vowels 1. naming
muscles. A useful guideline is, if in doubt, define 2. syllables (i.e., consonant–vowel 2. questions
codes at a somewhat finer level of granularity than transitions) 3. acknowledgments
your research questions require (i.e., when in 3. babbling (a sequence of repeated 4. imitations
syllables) 5. attributions
doubt split, don’t lump). You can always analyti- 4. other (e.g., cry, laugh, vegetative 6. directives
cally lump later but, to state the obvious, you can- sounds) 7. play vocalizations
not recover distinctions never made.
Figure 12.2. Two additional examples of coding
schemes; the first is more physically based and the
Concreteness: Physically to Socially second more socially based.
Based Codes
Another dimension, not the same as granularity, is of reach. True, computer scientists are attempting to
concreteness. Bakeman and Gottman (1986, 1997) automate the process, and some limited success has
suggested that coding schemes could be placed on been achieved with automatic computer detection
an ordered continuum with one end anchored by of Ekman-like facial action patterns (Cohn &
physically based schemes and the other by socially Kanade, 2007), but the more socially based codes
based ones. More physically based codes reflect become, the more elusive any kind of computer
attributes that are easily seen, whereas more socially automation seems. Consider the second coding
based codes rely on abstractions and require some scheme for maternal vocalizations (also adapted
inference (our apologies if any professional meta- from Gros-Lewis et al., 2006). It is difficult to see
physicians find this too simple). An example of a how this could be automated. For the foreseeable
physically based code might be infant crying, future at least, a human coder—a perceiver—likely
whereas an example of a more socially based code will remain an essential part of behavioral
might be a child engaged in cooperative play. Some observation.
ethologists and behaviorists might regard the former Finally, consider the well-known coding scheme
as objective and the latter subjective (and so less sci- defined by S. S. Stevens (1946), which we reference
entific), but—again eliding matters that concern subsequently. The scheme categorizes measurement
professional philosophers—we would say that the scales as (a) nominal or categorical—requires no
physically–socially based distinction may matter more than assigning names to entities of interest
most when selecting and training observers. Do we where the names have no natural order, (b) ordinal—
regard them as detectors of things really there? Or requires ordering or ranking those entities, and
more as cultural informants, able through experi- (c) interval—involves assigning numbers such that
ence to “see” the distinctions embodied in our cod- an additional number at any point on the scale
ing schemes? To our mind, a more important involves the same amount of whatever is measured,
question about coding schemes is whether we can and (d) ratio—distinguishing between interval
train observers to be reliable, a matter to which we scales for which zero is arbitrary like degrees Celsius
return later. and for which zero indicates truly none of the quan-
Examples often clarify. Figure 12.2 presents two tity measured like kilograms. This last distinction is
additional coding schemes. The first categorizes the less consequential statistically than the first three.
vocalizations of very young infants (simplified and In sum, for behavioral observation to succeed,
adapted from Gros-Louis, West, Goldstein, & King, the investigator’s toolbox should include well-
2006; see Oller, 2000). It is a good example of a designed, conceptually coherent, and piloted-tested
physically based coding scheme, so much so that coding schemes—these are the primary measuring
it is possible to automate its coding using sound instruments for behavioral observation. Often, but
spectrograph information. Computer coding— not always, each scheme reflects a dimension of
dispensing with human observers—has tantalized interest and consists of a set of mutually exclusive
investigators for some time, but remains mainly out and exhaustive codes.
211
Bakeman and Quera
Recording Coded Data: From Events and Intervals Are Primary

Pencil and Paper to Digital Recording Units
Earlier we defined measurement as assigning num-
In the previous section, we argued that coding
bers to things. Given Stevens’s (1946) coding
schemes are a conceptual matter, necessarily reflect
scheme for scales, we can now see that “numbers”
your theoretical orientation, and work best when
should be expanded to include categories (i.e.,
they mesh well with your research questions. In
codes) and ranks (i.e., numbers representing ordinal
contrast, recording the data that result from apply-
position). Thus, behavioral observation almost
ing those coding schemes (i.e., initial data collec-
always begins with categorical measurement: Codes
tion) is a practical matter requiring physical
are assigned to things (although ordinal measure-
materials ranging from simple pencil and paper to
ment using ratings is another but less frequently
sophisticated electronic systems.
used possibility). But to what “things” are codes
assigned? The answer is this: events (which may
Live Observation Versus vary in duration) or intervals (whose duration is
Recorded Behavior fixed). These are the two primary recording units
Perhaps the first question to ask is, are coders used for observational data.
observing live or are they working with recordings Corresponding to the two primary recording
(video–audio or just audio, on tape or in digital units are two primary strategies for recording coded
files)? Whenever feasible, we think recordings are data: continuous and interval recording. As noted,
preferable. First and most important, recorded the primary analytic unit for behavioral observation
material can be played and replayed—literally is a session. Continuous recording implies continu-
re-viewed—and at various speeds, which greatly ously alert observers, ready to code events when
facilitates the observer’s task. Second, only with they do occur. In contrast, interval recording (also
recorded materials can we ask our observers to code referred to as time-sampling; Altmann, 1974)
different aspects in different passes, for example, requires that the session be segmented into fixed-
coding a mother’s behavior in one pass and her length intervals and that observers assign a code or
infant’s in another. Third, recorded materials facili- codes to each successive interval. The length of the
tate checks on observers’ reliability, both between interval may vary from study to study but relatively
observers who can be kept blind to the reliability brief intervals are fairly common (e.g., 10–15 sec-
check, and within observers when asked to code onds). In subsequent paragraphs, we discuss inter-
the same session later. Fourth, contemporary com- val and continuous recording in greater detail and
puter systems for capturing coded data work best note advantages and disadvantages of each.
with recorded material (especially digital files).
Finally, video–audio recording devices are rela- Interval Recording
tively inexpensive; cost is not the factor it was in Arguably, interval recording is a limited technique,
past decades. more used in the past than currently. Its merits are
Nonetheless, live observation may still be pre- primarily practical: It can be easy and inexpensive to
ferred in certain circumstances. In some settings implement, but as a trade-off, data derived from
(e.g., school classrooms), video–audio recording interval recording may be less precise than data
devices may be regarded as too intrusive, or for ethi- derived from other methods. Interval recording
cal or political reasons, permanent recordings may lends itself to pencil and paper. All that is needed is
be unwelcome. And in some circumstances (e.g., a timing device so that interval boundaries can be
observing animal behavior in the field) trained identified, a lined tablet with columns added (rows
human observers may be able to detect behaviors represent successive intervals and columns are
that are unclear on recordings. Moreover, live obser- labeled with codes), and a recording rule. A com-
vation is simpler; there is no need to purchase, learn mon recording rule is to check the interval if a
about, or maintain video–audio recording devices. behavior occurs once or more within it; this is called
212
partial-interval or zero-one sampling. Another possi- Nonetheless, we recognize that interval record-
bility is momentary sampling—check only if the ing has its partisans. Certainly interval recording
behavior is occurring at a defined instant, such as seemed a good choice for Mel Konner studying
the beginning of the interval (although in practice mother–infant interaction among the !Kung in
this often is interpreted as check the behavior that Botswana in the late 1960s and early 1970s (Bake-
predominated during the interval). Another less man, Adamson, Konner, & Barr, 1990; Konner,
used possibility is whole-interval sampling—check 1976). An electronic device delivered a click every
only if the behavior occurs throughout the interval 15 seconds to the observer’s ear. Observers then
(see Altmann, 1974; Suen & Ary, 1989). An exam- noted which of several mother, infant, adult, and
ple using infant state codes is shown in Figure 12.3; child behaviors had occurred since the last click.
because each line is checked for one and only one of The remote location and the need for live observa-
these ME&E codes, we can assume that momentary tion in this era before inexpensive and reliable video
sampling was used. made interval recording the method of choice.
As noted, with interval recording, summary sta- A variant of interval recording could be termed
tistics may be estimated only approximately. For interrupted interval recording, which is sometimes
example, with zero-one sampling, frequencies likely used in education and other settings. Coders observe
are underestimated (a check can indicate more than for a fixed interval (e.g., 20 seconds) and then
one occurrence in an interval), proportions are record for another fixed interval (e.g., 10 seconds).
likely overestimated (a check does not mean the Such data are even less suitable for any sort of
event occupied the entire interval), and sequences sequential analysis than ordinary interval recording,
can be muddled (if more than one code is checked but as with standard interval recording, when
for an interval, which occurred first?)—and momen- approximate estimates are sufficient, simplicity of
tary and whole-interval sampling have other prob- implementation may argue for even interrupted
lems. There are possible fixes to these problems, but interval recording.
none seem completely satisfactory. As a result,
unless approximate estimates are sufficient to Continuous Untimed Event Recording
answer your research questions and the practical Imagine that we ask observers simply to note when-
advantages seem decisive, we usually recommend ever an event of interest occurs and record its code.
event and not interval recording whenever feasible. What could be simpler? Like interval recording,
simple event recording is limited but nonetheless
sufficient to answer some research questions. The
Infant code sequence of events is preserved but no information
interval alert cry fussy REM sleep concerning their duration is recorded. Thus, we can
1 √ report how often different types of events occurred
2 √ and in what sequence, but we cannot report the
3 √ average time different types of events lasted or how
4 √ much of the session was devoted to different types
5 √
of events.
Again, like interval recording, untimed event
6 √
recording lends itself to pencil and paper. Using a
7 √
lined paper tablet, information identifying the ses-
8 √
sion can be written at the top and then codes for
9 √ each event noted on successive lines. Two refine-
10 √ ments are possible. First, even though event dura-
… tions remain unrecorded, the start and stop times
Figure 12.3. An example of interval recorded data for the session can be recorded. Then rates for the
for infant state. various events can be computed for each session and
213
Bakeman and Quera
will be comparable across sessions that vary in Japan and 25 per second per the Phase Alternating
length. Second, each event can be coded on more Line [PAL] standard used in most of Europe, the
than one dimension using more than one set of Near East, South Asia, and China). Typically, codes,
ME&E codes, in effect cross-classifying the event their characteristics, and the recording method are
and producing data appropriate for multidimen- defined initially. Then when a key is pressed on sub-
sional contingency tables. Such multievent data are sequent playback (or a code displayed on-screen is
formally identical with interval recorded data and selected, i.e., clicked), a record of that code and the
could be collected using forms similar to the one current time are shown on-screen and stored in an
shown in Figure 12.3, adding columns for additional internal data file. With such systems, the human
ME&E sets. The only difference is, lines are associ- observers do not need to worry about clerical details
ated with successive intervals for interval recording or keep track of time; the computer system attends
and with successive events for multievent (but to these tasks. If you make a mistake and want to
untimed) recording. add or delete a code or change a time, typically edits
Recording simply the sequence of events, or can be accomplished on-screen with minimal effort.
sequences of cross-classified events, but ignoring the The result is a file containing codes along with their
duration of those events, limits the information that onset and (optionally) offset times. Programs vary in
can be derived from the coded data. If your research their conventions and capabilities, but when sets of
questions require nothing more than information ME&E codes are defined, most systems automati-
about frequency (or rate), sequence, and possibly cally supply the offset time for a code when the
cross-classification, then the simplicity and low cost onset time of another code in that set is selected;
of untimed event recording could cause you to and when some codes are defined as momentary,
adopt this approach. meaning that only their frequency and not their
duration is of concern, offset times are not required.
Continuous Timed-Event Recording Some systems permit what we call post hoc
More useful and less limited data—data that offer coding—first you detect an event and only afterward
more analytic options—result when not just events code it. For example, when you think an event is
but their durations are recorded (i.e., their onset and beginning, hold down the space bar, and when it
offset times). In general, this is the approach we rec- ends, release, which pauses play. You can then
ommend. Of course, there is a price. Recording decide what the code should be, enter it, and restart
event onset and offset times is inevitably more com- play. Another advanced feature allows subsequent
plicated than either interval or untimed event choices for an event to be determined by prior ones
recording. The good news is, advances in technol- (one term for this is lexical chaining). For example,
ogy in the past few decades have made timed-event if you select mother for an event, a list of mother
recording simpler and more affordable than in the behavior codes would be displayed (e.g., talk, rock),
past. Continuous timed-event recording does not whereas if you had selected infant, a list of infant
absolutely require computer technology, but none- codes would be displayed (e.g., cry, sleep). The next
theless works best with it. list thereafter, if any, could be determined by the
Let us begin by describing what is possible with particular mother or infant behavior selected, and so
current computer technology. Users can play one or on, until the end of the lexical chain.
more (synchronized) digital video–audio files using Although capabilities vary, computer systems of
on-screen controls. The image (or images) can be the sort just described free coders to concentrate on
paused and then played forward or backward at vari- making judgments; clerical tasks are handled auto-
ous speeds, displaying the current time (rounded to matically and thus the possibility of clerical error is
a fraction of a second or accurate to the video greatly reduced. Such systems can work with live
frame—there are 29.97 per second per the National observation or videotapes, but they are at their best
Television System Committee [NTSC] standard with digital files. With digital files, you can jump
used in North America, much of South America, and instantly to any point in the file, whereas with
214
videotapes, you would wait while the tape winds; you paper feel good in the hand, possess a satisfying
can also ask that a particular episode be replayed physicality, rarely malfunction, and do not need
repeatedly, and you can assemble lists of particular batteries.
episodes that then can be played sequentially, ignor- Interval and untimed event recording produce
ing other material. Such capabilities are useful for more limited, less rich, less precise data—data with
coding and training and for education purposes gen- fewer analytic options—than timed-event recording.
erally. (Two examples of such systems are Mangold At the same time, they can work satisfactorily with
International’s INTERACT [see http://www.mangold- simple and inexpensive equipment, including pencil
international.com] and Noldus Information Technol- and paper observing either recorded material or live.
ogy’s The Observer [see http://www.noldus.com]; an In contrast, timed-event recording works best when
Internet search will quickly reveal others.) video–audio recordings are used along with some
Still, there is no need for investigators who electronic assistance. It is the usual trade-off: richer
require continuous timed-event recording to despair data, more analytic options, less tedious coding,
when budgets are limited. Digital files can be played fewer clerical errors, more tasks automated—as well
with standard and free software on standard com- as greater expense, longer learning times, and more
puters, or videotapes can be played on the usual resources devoted to maintenance. As always, the
tape playback devices. Codes and times then can be right recording system is the one that matches
manually entered into, for example, a spreadsheet resources with needs, and when simpler, less precise
program running on its own computer or simply data are sufficient to answer key research questions,
written on a paper tablet. Times can even be written simple and inexpensive may be best.
when coding live; only pencil, paper, and a clock are
needed. Such low-tech approaches can be tedious
Representing Observational Data:
and error-prone—and affordable. When used well,
The Code-Unit Grid
they can produce timed-event data that are indistin-
guishable from that collected with systems costing Too often investigators take their data as collected
far more. Still, as our rich city cousin might say, it and move directly to analysis, bypassing what can be
won’t be as much fun. an important step. This intervening step involves
representing—literally, re-presenting—your data, by
Pencil-and-Paper Versus Electronic which we mean transforming the data-as-collected
Methods into a form more useful for subsequent analysis. As
To summarize, once coding schemes are defined, described in the previous section, when recording
refined, and piloted, and once observers are trained, observational data initially, observer ease and accu-
you are ready to begin recording data. Derivation of racy are of primary importance. Therefore, it makes
summary measures and other data reduction comes sense to design data collection procedures that work
later, but initial data collection (i.e., observational well for our observers; but analysis can be facilitated
measurement) consists of observers assigning codes by how those data are represented subsequently,
to either fixed time intervals or events. When codes especially for timed-event recording.
are assigned to events, the events may be untimed, When both preparing data for subsequent analy-
or onset and offsets times may be recorded (or in sis and thinking about what those analyses should
some cases inferred). No matter whether the behav- be, we have found it extremely helpful to organize
ior is observed live or first recorded, any of these observational data in one common underlying for-
strategies (interval, untimed, or timed-event record- mat (Bakeman, 2010). That underlying format is a
ing) could be used with anything from pencil, paper, grid, which is an ancient and useful organizing
and perhaps a timing device to a high-end, bells- device. For observational data rows represent codes
and-whistles computerized coding system. Pencil- and columns represent units (which are either inter-
and-paper methods have their advantages. As noted vals for interval recorded data, or events for untimed
in Bakeman and Gottman (1986, 1997), pencil and event recorded data, or time units for timed-event
215
Bakeman and Quera
recorded data). Thus, for interval and untimed event fifth type, state sequential data, is simply a variant of
data, recording and representational units are the timed-event sequential data for which data entry is
same, whereas for timed-event data, recording and simplified if all codes can be assigned to ME&E sets.
representational units differ: events for recording For simple examples, see Bakeman, Deckner, and
and time units for representing. Quera (2005) and Bakeman and Quera (2011).
The time units for timed-event data are defined Once data are formatted per SDIS conventions,
by the precision with which time was recorded. If they can be analyzed with any general-purpose com-
seconds, each column of the grid represents a sec- puter program that uses this standard, such as the
ond; if tenths of a second, each column represents a Generalized Sequential Querier (GSEQ; Bakeman &
tenth of a second; and so forth (see Figure 12.4). Quera, 1995, 2009, 2011), a program we designed,
Computer programs may display multiple digits not for initial data collection, but specifically for
after the decimal point but, unless specialized equip- data analysis. Much of the power and usefulness of
ment is used, claiming hundredth-of-a-second accu- this program depends on representing observational
racy is dubious. Video recording is limited by the data in terms of a universal code–unit grid, as just
number of frames per second (a frame is 0.033 sec- described. Three advantages are noteworthy. First,
onds for NTSC and 0.040 seconds for PAL), which representing observational data as a grid in which
allows claims of tenth of a second or somewhat rows represent codes and columns represent succes-
greater but not hundredth-of-a-second accuracy. sive events, intervals, or time units makes the appli-
Moreover, for most behavioral research questions, cation of standard frequency or contingency table
accuracy to the nearest second is almost always statistics easy (the column is the tallying unit). Sec-
sufficient. ond, the grid representation makes data modifica-
Understanding that investigators use different tion easy and easy to understand. New codes (i.e.,
recording methods, as detailed in the previous sec- rows in the grid) can be defined and formed from
tion, yet also recognizing the advantages of repre- existing codes using standard logical operations
sentational standards, some years ago we defined (Bakeman et al., 2005; Bakeman & Quera, 1995,
conventions for observational data, which we called 2011). Third, the discrete time-unit view (i.e., seg-
the Sequential Data Interchange Standard (SDIS) menting time into successive discrete time units
format (Bakeman & Quera, 1992). We defined five defined by precision) of timed-event sequential
basic data types. As you might guess from the previ- data solves some, but not all, problems in gauging
ous section, the two simplest were event sequential observer agreement, which is the topic of the
data and interval sequential data, which result from next section.
simple event and interval recording, respectively.
Multievent sequential data result when events are
Observer Agreement: Event Based,
cross-classified, as described earlier, and timed-event
Time Based, or Both?
sequential data result from timed-event recording. A
Observer agreement is often regarded as the sine qua
second
non of observational measurement. Without it, we
are left with individual narratives of the sort used in
Infant code 1 2 3 4 5 6 7 8 9 10
qualitative research (see Volume 2, Chapters 1 to 13,
alert
this handbook). Even so, a suitable level of agree-
cry
ment between two independent observers does not
fussy guarantee accuracy—two observers could share sim-
REM ilar deviant views of the world—but it is widely
sleep regarded as an index of acceptable measurement. If
test probes reveal that the records of two observers
Figure 12.4. An example of a code-unit grid for
timed-event recorded data with one-second precision recorded independently do not agree (or an observer
(i.e., times recorded to nearest second). does not agree with a presumably accurate standard),
216
the accuracy of any scores derived from data coded agreement minus the probability of agreement
by those observers is uncertain: Further observer expected by chance) by the maximum agreement
training, modification of the coding scheme, or both not due to chance (i.e., 1 minus the probability of
are needed. On the other hand, when test probes agreement expected by chance; see Bakeman &
reveal that observers’ records substantially agree (or Gottman, 1997): κ = (Pobs − Pexp)/(1 − Pexp). For this
an observer agrees with a presumably accurate stan- example, the value of kappa was .76 (Fleiss, 1981,
dard), we infer that our observers are adequately characterized kappas more than .75 as excellent, .40
trained and regard the data they produce as trust- to .75 as fair to good, and below .40 as poor, p. 218).
worthy and reliable. Now here is the problem. Cohen’s kappa assumes
that pairs of coders make decisions when presented
Classic Cohen’s Kappa and Interval with discrete units and that the number of decisions
Recorded Data is the same as the number of units. This decision-
Probably the most commonly used statistic of making model fits interval-recorded data well but
observer agreement is Cohen’s kappa (1960), fits event-recorded data only when events are pre-
although as we explain shortly it is most suited for sented to coders as previously demarcated units, for
interval recorded data. The classic Cohen’s kappa example, as turns of talk in a transcript. Usually
characterizes agreement with respect to a set of events are not prepackaged. Instead, with event
ME&E codes while correcting for chance agree- recording, observers are asked to first segment the
ment. It assumes that things—demarcated units— stream of behavior into events (i.e., detect the seams
are presented to a pair of observers, each of whom between events) and then code those segments.
independently assigns a code to each unit. Each pair Because of errors of omission and commission—one
of observer decisions is tallied in a K × K table (also observer detects events the other misses—usually
called an agreement or confusion matrix) where K is the two observer’s records will contain different
the number of codes in the ME&E set. For example, numbers of events and exactly how the records align
if 100 intervals were coded using the five infant state is not always obvious. And when alignment is
codes defined earlier, the agreement matrix might uncertain, how events should be paired and tallied
look like the one shown in Figure 12.5. In this in the agreement matrix is unclear.
case, the two observers generally agreed (i.e., most
tallies were on the diagonal); the most frequent Aligning Untimed Events When
confusion—when Observer 1 coded alert but Observers Disagree
Observer 2 coded fussy—occurred just three times. Aligning two observers’ sequences of untimed events
Kappa is computed by dividing chance-corrected is problematic. Bakeman and Gottman (1997) wrote
observed agreement (i.e., the probability of observed that especially when agreement is not high, align-
ment is difficult and cannot be accomplished with-
out subjective judgment. Recently, however, Quera,
Obs 1’s Obs 2’s codes Bakeman, and Gnisci (2007) developed an algo-
codes alert cry fussy REM sleep TOTAL rithm that determines the optimal global alignment
alert 26 2 3 0 0 31 between two event sequences. The algorithm is
cry 2 27 1 0 0 30 adopted from sequence alignment and comparison
fussy 1 2 4 2 1 10 techniques that are routinely used by molecular
REM 0 0 1 17 1 19 biologists (Needleman & Wunsch, 1970). The task
sleep 0 0 0 2 8 10 is to find an optimal alignment. The Needleman–
TOTAL 29 31 9 21 10 100
Wunsch algorithm belongs to a broad class of
methods known as dynamic programming, in which
Figure 12.5. Agreement matrix for two observers the solution for a specific subproblem can be
who independently coded 100 fixed-time intervals
using the infant state coding scheme. For this example, derived from the solution for another, immediately
Cohen’s kappa = .76. Obs = observer. preceding subproblem. It can be demonstrated that
217
Bakeman and Quera
the method guarantees an optimal solution, that is, kappa, with two qualifications. First, because
it finds the alignment with the highest possible observers cannot both code nil, the resulting agree-
number of agreements between sequences (Sankoff & ment matrix contains a logical (or structural) zero;
Kruskal, 1983/1999, p. 48) without being exhaus- as a consequence, the expected frequencies required
tive, that is, it does not need to explore the almost by the kappa computation cannot be estimated with
astronomical number of all possible alignments the usual formula for kappa but instead require an
(Galisson, 2000). iterative proportional fitting (IPF) algorithm (see
The way the algorithm works is relatively com- Bakeman & Robinson, 1994). Second, because
plex, but a simple example can at least show what Cohen’s assumptions are not met we should not call
results. Assume that two observers using the infant this a Cohen’s kappa; it might better be alignment
vocalization scheme described earlier recorded the kappa instead (specifically, an event-based dynamic
two event sequences (S1 and S2) shown in Figure 12.6. programming alignment kappa). For this example,
The first observer coded 15 events and the second we used our GSEQ program to determine the align-
14, but because of omission–commission errors, the ment and compute alignment kappa.
optimal alignment shows 16. The 11 agreements are
indicated with vertical bars and the two actual dis- Time-Unit and Event-Based Kappas
agreements with two dots (i.e., a colon), but there for Timed-Event Data
were three additional errors: The algorithm esti- The alignment algorithm solves the problem for
mated that Observer 1 missed one event that untimed event data but what of timed-event data?
Observer 2 coded (indicated with a hyphen in the We have proposed two solutions. The first solution,
top alignment line) and Observer 2 missed two which is the one presented by Bakeman and Gott-
events that Observer 1 coded (indicated with man (1997), depends on the discrete view of time
hyphens in the bottom alignment line). The align- reflected in the code-time-unit grid described ear-
ment then lets us tally paired observer decisions lier. Assuming a discrete view of time and a code-
(using nil to indicate a missed event) and compute time-unit grid like the one shown in Figure 12.4,
agreement between successive pairs of time units
can be tallied and kappa computed. As a variant,
Obs 1’s Obs 2’s codes
agreement could be tallied when codes for time
codes nil vowel syllable babble other TOTAL
units matched if not exactly at least within a stated
nil – 0 1 0 0 1
tolerance (e.g., 2 seconds). Because time units are
vowel 1 3 0 0 0 4 tallied, the summary statistic should be called time-
syllable 1 0 4 0 0 5 unit kappa, or time-unit kappa with tolerance, to
babble 0 0 0 3 0 3 distinguish it from the classic Cohen’s kappa.
other 0 1 1 0 1 3 One aspect of time-unit kappa seems trouble-
TOTAL 2 4 6 3 1 16 some. With the classic Cohen model, the number of
Sequences: tallies represents the number of decisions coders
S1 = vvsbosbosvosbvs make, whereas with time-unit kappa, the number of
S2 = vsbssbsvsvobvs tallies represents the length of the session (e.g.,
Alignment:
vvsbosb-osvosbvs
when time units are seconds, a 5-minute session
|||:|| :||| ||| generates 300 tallies). With timed-event recording,
-vsbssbsvsvo-bvs observers are continuously looking for the seams
Figure 12.6. Alignment of two event sequences
between events, but how often they are making
per our dynamic programming algorithm, and the decisions is arguable, probably unknowable. One
resulting agreement matrix. For alignment, vertical bars decision per seam seems too few—the observers are
indicate exact agreement, two dots (colon) disagree- continuously alert—but one per time unit seems
ments, and hyphens events coded by one observer but
not the other. For this example, alignment kappa = .60. too many. Moreover, the number of tallies
Obs = observer. increases with the precision of the time unit
218
(although multiplying all cells in an agreement usual categorical measurements reflected in the data
matrix by the same factor does not change the value collected are transformed into scores for which, typ-
of kappa; see Bakeman & Gottman, 1997). ically, interval- or ratio-scale measurement can be
Thus the second solution is to align the events in assumed. As with scores generally, so too with sum-
the two observers’ timed-event sequential data and mary scores derived from behavioral observation,
compute an event-based kappa. Compared with tally- the first analytic step involves description, the
ing time units, tallying agreements and disagreements results of which may limit subsequent analyses (as,
between aligned events probably underestimates the e.g., when inappropriate distributions argue against
number of decisions observers actually make, but at analyses of variance). But what summary scores
least the number of tallies is closer to the number should be derived and described first?
of events coded. Consequently, we modified our It is useful to distinguish between simple statistics
untimed event alignment algorithm to work with that do not take sequencing or contingency into
timed-event sequential data and compared this algo- account (described in this section) and contingency
rithm with ones available in The Observer and statistics that do take contingency into account
INTERACT (Bakeman, Quera, & Gnisci, 2009). (described in the section Analyzing Observational
Kappas with the different event-matching algorithms Data: Contingency Indexes). It makes sense to
were not dramatically different; time-based and describe simple statistics first because, if their values
event-based kappas varied more. are not appropriate, computation of some contingency
In sum, we recommend as follows: When assess- statistics may be precluded or at best questionable.
ing observer agreement for interval-recorded data, Simple statistics based on behavioral observation are
use the classic Cohen’s kappa. For untimed event relatively few in number but, as you might expect,
recorded data, use our event-matching algorithm, their interpretation depends on the data recording and
which allows for omission–commission errors, and representation methods used. In the following para-
report the event-based kappa. For timed-event data, graphs, we describe seven basic statistics, note how
compute and report both an event-based and a time- data type affects their interpretation, and recommend
based kappa (with or without tolerance); their range which statistics are most useful for each data type.
likely captures the true value of kappa (both are
computed by GSEQ). Moreover, examining individ- Frequency, Rate, and Relative Frequency
ual cells of the kappa table provides observers with 1. Frequency indicates how often. For event or
useful feedback. In the case of timed-event data, timed-event data, it is the number of times an
observers should examine agreement matrixes for event occurred (i.e., was coded). For interval or
both event-based and time-unit-based kappas. Each multievent data, it is the number of bouts coded,
provides somewhat different but valuable informa- that is, the number of times a code was checked
tion about disagreements, which can be useful as without being checked for the previous interval
observers strive to improve their agreement. or multievent; that is, if the same code occurred
in successive intervals or multievents, one is
added to its frequency count. As noted shortly,
Analyzing Observational Data:
for interval and multievent data, duration gives
Simple Statistics
the number of units checked.
Perhaps more with behavioral observation than 2. Rate, which is the frequency per a specified
other measurement methods, the data collected ini- amount of time, likewise indicates how often.
tially are not analyzed directly. Intervening steps Rate is preferable to frequency when sessions
may be required with other methods—for example, vary in length because it is comparable across
producing a summary score from the items of a self- sessions. Rates may be expressed per minute,
esteem questionnaire—but with observational data, per hour, or per any other time unit that makes
producing summary scores and data reduction gen- sense. The session durations required to compute
erally are almost always required. In the process, the rate can be derived from the data for timed-event
219
Bakeman and Quera
and interval data but to compute rates for event sum to 1 and can also be expressed as percent-
or multievent data requires that session start and ages summing to 100%. For example, when
stop times be recorded explicitly. coding mother vocalizations, we might discover
3. Relative frequency indicates proportionate use of that 37% of the time when mother’s vocaliza-
codes. For all data types it is a code’s frequency, tions were occurring they were coded naming.
as just defined, divided by the sum of frequencies As with relative frequency, depending on your
for all codes in a specified set, hence relative fre- specific research questions, relative duration
quencies necessarily sum to 1. Alternatively, rela- may or may not be a statistic you choose to
tive frequencies can be expressed as percentages analyze.
summing to 100%. For example, if we only coded 7. Mean bout duration indicates how long events
mother vocalization, we might discover that 22% last, on average, and makes sense primarily for
of a mother’s vocalizations were coded naming. timed-event data. It is duration divided by fre-
As discussed shortly, depending on your specific quency, as just defined. When computed for
research questions relative frequency may or may interval or multievent data, it indicates the mean
not be a statistic you choose to analyze. number of successive intervals or multievents
checked for a particular code.
Duration, Probability, and Relative
and Mean Bout Duration Recommended Statistics by Data Type
4. Duration indicates how long or how many. No matter whether your sessions are organized by a
For timed-event data, duration indicates how group or a single-subject design, we assume you will
much time during the session a particular code compute summary statistics for individual sessions
occurred. For simple event data, duration is the (i.e., analytic units) and then subject those scores to
same as frequency. For interval or multievent further analyses. In the next few paragraphs, we dis-
data, duration indicates the number of intervals cuss the simple summary statistics we think are
or multievents checked for a particular code, most useful for each data type and, reversing our
thus duration may be a more useful summary usual order, begin with timed-event data, which
statistic for these data types than frequency, offers the most options.
which as just noted indicates the number of For timed-event data, we think the most useful
bouts. summary statistics indicate how often, how likely,
5. Probability indicates likelihood. It can be and how long. Rate and probability are comparable
expressed as either a proportion or a percent- across sessions (i.e., control for differences in ses-
age. For timed-event data, it is duration divided sion length) and therefore usually are preferable to
by total session time, leading to statements like, frequency and duration. Mean bout duration pro-
the baby was asleep for 46% of the session. For vides useful description as well, but here you have a
simple event data, it is the same as relative fre- choice. These three statistics are not independent
quency. For interval or multievent data, it is (mean bout length is duration divided by fre-
duration divided by the total number of intervals quency), thus present just two of them, or if you
or multievents, leading to statements like, soli- describe all three, be aware that any analyses are not
tary play was coded for 18% of the intervals. independent. Finally, use relative frequency or dura-
6. Relative duration indicates proportionate use tion only if clearly required by your research
of time for timed-event data and of intervals or questions.
multievents for interval and multievent data. With timed-event data a key question is, should
For all data types it is a code’s duration, as just you use rate, probability, or both? These two statis-
defined, divided by the sum of durations for all tics provide different, independent information
codes, thus it only makes sense when the codes about your codes; they may or may not be corre-
specified form a single ME&E set. As with rela- lated. The answer is, it depends on your research
tive frequency, relative durations necessarily question. For example, do you think that how often
220
a mother corrects her child is important? Then Statistics for Individual Cells
use rate. Or, do you think that the amount of time Statistics for the individual cells of a contingency
(expressed as a proportion or percentage of the table can be computed for tables of varying dimen-
session) a mother corrects her child (or a child sion but for illustrative purposes we give examples
experiences being corrected) is important? Then use for the 2 × 2 table shown in Figure 12.7. The rows
probability. Whichever you use (or both), always represent infant cry, columns represent mother
provide your readers with an explicit rationale for soothe, and the total number of tallies is 100. This
your choice; otherwise, they may think your deci- could be 100 events, or 100 intervals, or 100 time
sion was thoughtless. units, depending on data type. Usually for timed-
For other data types, matters are simpler. For event and often for interval or multievent data, rows
simple event data, we think the most useful sum- and columns are unlagged, that is, they represent
mary statistics indicate how often and how likely— concurrent time units, intervals, or events (i.e., Lag 0).
that is, frequency (or rate when sessions vary in For simple event data, columns usually are lagged
length and start and stop times were recorded) and (because all co-occurrences are zero); thus rows
probability. Finally, for interval and multievent data, might represent Lag 0 and columns Lag 1, in which
we think the most useful summary statistic indicates case the number of tallies would be one less than the
how likely—that is, how many intervals or multi- number of simple events coded.
events were checked for a particular code. With In the following paragraphs, we give definitions
these data types, use other statistics only if clearly for the five most common cell statistics and provide
required by your research questions. numeric examples derived from the data in Figure 12.7.
In these definitions, r specifies a row, c a column, frc
the frequency count for a given cell, fr+ a row sum, f+c
Analyzing Observational Data:
a column sum, f++ the total number of tallies for the
Contingency Indexes
table, and pi the simple probability for a row or col-
The summary statistics described in the previous umn (e.g., pr = fr+ ÷ f++).
section were called simple, but they could also be
1. The observed joint frequency is frc. The joint fre-
called one-dimensional because each statistic is
quency for cry and soothe is 13.
computed for a single code. In contrast, the sum-
2. The conditional probability is the probability for
mary statistics described in this section could be
the column (or target) behavior given the row (or
called two-dimensional because they combine
given) behavior: p(c|r) = frc ÷ fr+. The conditional
information about two codes, arranged in two-
dimensional contingency tables. Still, the overall
strategy is the same; summary statistics are com-
target
puted for individual sessions followed by appropri-
given soothe no soothe TOTAL
ate statistical analyses.
cry 13 11 24
Statistics derived from two-dimensional tables
are of three kinds. First are statistics for individual no cry 21 55 76
cells; these are primarily descriptive. Second are TOTAL 34 66 100
summary statistics for 2 × 2 tables; these indexes of p(cry) = 24/200 = .24
contingency often turn out to be the most useful p(soothe) = 34/100 = .34
p(cry|soothe) = 13/24 = .54
analytically. And third are summary indexes of inde-
pendence and association for tables of varying Odds ratio = (13/11)/(21/55) = 1.18/0.38 = 3.10
Log odds = 1.13
dimensions such as Pearson chi-square and Cohen’s Yule’s Q = .51
kappa; because these are well-known or already dis-
cussed, we will not discuss them further here but Figure 12.7. Determining the association between
infant cry and maternal soothe: An example of a 2 × 2
instead focus on individual cell and 2 × 2 table table tallying 1-second time units and its associated
statistics. statistics.
221
Bakeman and Quera
probabilities in a row necessarily sum to 1. The one shown in Figure 12.7. In this table, rows are
probability of a mother soothe given an infant cry labeled given behavior, yes or no, and columns are
is .54 and of no soothe given infant cry is .46. labeled target behavior, yes or no. This is advanta-
3. The expected frequency is the frequency expected geous because then the contingency between the
by chance given the simple probability for the presumed given and target behavior can be assessed
column behavior and the frequency for the row with standard summary statistics for 2 × 2 tables. In
behavior: exprc = pc × fr+ = (f+c ÷ f++) × fr+. The the next few paragraphs, we provide definitions for
expected frequency for cry and soothe is 8.16, the four summary statistics typically defined but
which is less than the observed value of 13. probably only one or two are needed. As is conven-
4. The raw residual is the difference between tional, we label the cells of the 2 × 2 table as follows:
observed and expected: resrc = frc − exprc. The f11 = a, f12 = b, f21 = c, f22 = d. Again, numeric examples
observed joint frequency for cry and soothe are derived from the data in
exceeds the expected value by 4.84 (13 − 8.16). Figure 12.7.
5. The adjusted residual is the raw residual divided
by its estimated standard error: zrc = (frc − exprc) ÷ 1. The odds ratio is a measure of effect size whose
SErc where SErc = square root of exprc × (1 − pc) × interpretation is straightforward and concrete:
(1 − pr). The standard error is the square root of OR = (a/b)/(c/d). It is useful descriptively and
8.16 × .76 × .66 = 2.02, thus zrc = 4.84 ÷ 2.02 = deserves to be used more by behavioral scientists
2.39. If adjusted residuals were distributed nor- (it is already widely used by epidemiologists).
mally we could say that the probability of a result As the name implies, it is the ratio of two odds,
this extreme by chance is less than .05 because derived from the top and bottom rows of a 2 ×
2.39 exceeds 1.96. 2 table. For example, the odds of soothe to no
soothe when crying are 13 to 11 or 1.18
Of these statistics, perhaps the adjusted residual is to 1 and when not crying are 21 to 55 or 0.38
the most useful. Values that are large and positive, or to 1, thus OR = 1.18/0.38 = 3.10. Concretely, this
large and negative, indicate co-occurrences (or means that the likelihood (odds) of the mother
lagged associations) greater, or less, than expected by soothing her infant are more than three times
chance; a useful guideline is to pay attention to val- greater when her infant is crying than when not.
ues greater than 3 absolute. Of the others, the condi- The odds ratio varies from 0 to infinity
tional probability is useful descriptively but not with 1 indicating no effect. Values greater than 1
analytically because its values are contaminated by indicate that the target behavior (in column 1) is
its simple probabilities. For example, if cry occurs more likely in the presence of the given behavior
frequently, then values of soothe given cry are likely (row 1) than its absence (row 2), whereas values
to be higher than if cry was not as frequent. In other less than 1 indicate that the target behavior (in
words, the more frequently a code occurs, the more column 1) is more likely in the absence of the
likely another code is to co-occur. The adjusted given behavior (row 2) than its presence (row 1).
residual is a better candidate for subsequent analyses, Because the odds ratio varies form 0 to infinity,
but 2 × 2 contingency indexes as described in the its distributions often are skewed. Consequently,
next paragraph may be even better. the odds ratio, although useful descriptively, is
not so useful analytically.
Contingency Indexes for 2. The log odds is the natural logarithm of the odds
2 × 2 Tables ratio: LnOR = logeOR. For example, loge3.10 = 1.13
When research questions involve the contingency (i.e., 2.718. . .1.13 = 3.10). It varies from negative
between two behaviors, one presumed antecedent to positive infinity with zero indicating no effect,
and the other consequent (i.e., before and after, and compared with the odds ratio, its distribu-
given and target, or row and column), tables of any tions are less likely to be skewed. However, it is
dimensions can be reduced to a 2 × 2 table like the expressed in difficult-to-interpret logarithmic
222
units. As a result, it is useful analytically but not value as missing. With few observations, there is lit-
descriptively. tle reason to have confidence in its value even when
3. Yule’s Q is an index of effect size that is a straight- computation is technically possible. Our guideline
forward algebraic transform of the odds ratio: Q = is, if any row or column sum is less than 5, regard
(ad − bc)/(ad + bc) (Bakeman et al., 2005). It is the value of the contingency index as missing, but
like the familiar correlation coefficient in two some investigators may prefer a more stringent
ways: it varies from −1 to + 1 with 0 indicating guideline.
no effect, and its units have no natural meaning.
Thus, its interpretation is not as concrete as the Lag Sequential Analysis for
odds ratio. Simple Event Data
4. The phi coefficient is a Pearson product-moment Given simple event data and lagged contingency
correlation coefficient computed for binary data. tables, either the adjusted residuals or the contin-
Like Yule’s Q it can vary form −1 to + 1, but can gency indexes just described could be used for a lag
only achieve its maximum value when pr = pc = .5, sequential analysis. For example, if Figure 12.7 rep-
thus Yule’s Q almost always seems preferable. resented Lag 1 event data (given labeled Lag 0, tar-
get Lag 1, and tallying events and not time units),
Which contingency index should you use, the we could say that the probability of a soothe event
odds ratio descriptively and the log odds analyti- following a cry event was .54, which is greater than
cally, or Yule’s Q for both? It is probably a matter of the simple probability of soothe (.34). Moreover, the
taste. We think the odds ratio is more concretely adjusted residual was 2.39 and the Yule’s Q was .51,
descriptive, but Yule’s Q may seem more natural to both positive. (For a more detailed description of
some, especially those schooled in correlation coeffi- event-based lag sequential analysis see Bakeman &
cients. Another consideration is computational vul- Gottman, 1997, pp. 111–116.)
nerability to zero cells. A large positive effect
(column 1 behavior more likely given row 1 behav- Time-Window Sequential Analysis
ior) occurs as b (or c) tends toward zero and a large for Timed-Event Data
negative effect (column 1 behavior less likely given Given timed-event data, traditional lag sequential
row 1 behavior) occurs as a (or d) tends toward analysis (using time units to indicate lags) does not
zero. If only one cell is zero a large negative and a work very well. Time-window sequential analysis
large positive effect is computed as −1 and + 1, 0 and (Bakeman, 2004; Bakeman et al., 2005; Yoder &
infinity, and undefined (log of 0) and undefined Tapp, 2004) works better, allows more flexibility,
(divide by 0), for Yule’s Q, the odds ratio, and the and, incidentally, demonstrates the usefulness of the
logs odds, respectively. Thus Yule’s Q is not vulnera- contingency indexes just described (e.g., see Chor-
ble to zero cells, the odds ratio is vulnerable only if b ney, Garcia, Berlin, Bakeman, & Kain, 2010). The
or c are zero (using the computational formula, generic question is, is the target behavior contingent
ad/bc, for the odds ratio), and the log odds is vulner- on the given behavior. First, we define a window of
able if any cell is zero—which leads many to advo- opportunity or time window for the given behavior.
cate adding a small constant, typically ½, to each cell For example, we might say for a behavior to be con-
before computing a log odds (e.g., Wickens, 1989). tingent we need to see a response within 3 seconds;
One circumstance is always fatal. If two or more thus, we would code the onset second of the given
cells are zero—which means that one or more row behavior and the following 2 seconds as a given win-
or column sums are zero—no contingency index dow (assuming 1-second precision). Second, we
can be computed and subsequent analyses would code any second in which the target behavior starts
treat its value as missing. After all, if one of the as a target onset. Third, we tally time units for the
behaviors does not occur, no contingency can be session into a 2 × 2 table, and fourth we compute a
observed. Even when row or column sums are not contingency index for the table (this can all be done
zero but simply small, it may be wise to treat the with GSEQ).
223
Bakeman and Quera
For example, assume the tallies in Figure 12.7 best with electronic equipment. Behavioral observa-
represent 1-second time units, that soothe refers to tion can be used for experimental or nonexperimen-
the onset of verbal reassurance (it is probably better tal studies, in laboratory or field settings, and with
to imagine a behavior more quick and frequent than single-subject or group designs using between- or
soothe for this example), and that cry refers to a cry within-subjects variables. Summary scores derived
window (e.g., within 3 seconds of a cry onset). Thus from observational sessions can be subjected to any
in 100 seconds there were 34 reassure episodes appropriate statistical approach from null-hypothesis
(onsets or bouts) and probably 8 episodes of infant testing to mathematical modeling (Rodgers, 2010).
cry (assuming the 24 seconds total divide into eight
3-second windows). For this example, reassure and References
cry appear associated. The likelihood that reassure Altmann, J. (1974). Observational study of behav-
would begin within 3 seconds of a cry starting was iour: Sampling methods. Behaviour, 49, 227–267.
3 times greater than at other times (and Yule’s Q was doi:10.1163/156853974X00534
.51). Descriptively, 38% (13 of 34) of reassure epi- Bakeman, R. (2004). Sequential analysis. In M. Lewis-
Beck, A. E. Bryman, & T. F. Liao (Eds.), The Sage
sodes began during cry windows although the win-
encyclopedia of social science research methods (Vol. 3,
dows accounted for 24% of the time. It only remains pp. 1024–1026). Thousand Oaks, CA: Sage.
to compute such indexes for other sessions and use Bakeman, R. (2010). Reflections on measuring behavior:
those scores in whatever analyses make sense given Time and the grid. In G. Walford, E. Tucker, & M.
your design. Viswanathan (Eds.), The Sage handbook of measure-
ment (pp. 221–237). Thousand Oaks, CA: Sage.
Bakeman, R., Adamson, L. B., Konner, M., & Barr, R.
Conclusion (1990). Kung infancy: The social context of object
exploration. Child Development, 61, 794–809.
Behavioral observation is one of several measure- doi:10.2307/1130964
ment approaches available to investigators engaged Bakeman, R., Deckner, D. F., & Quera, V. (2005).
in quantitative behavioral research. It is often the Analysis of behavioral streams. In D. M. Teti (Ed.),
method of choice when nonverbal organisms are Handbook of research methods in developmental sci-
studied (or nonverbal behavior generally); when ence (pp. 394–420). Oxford, England: Blackwell.
doi:10.1002/9780470756676.ch20
more natural, spontaneous, real-world behavior is
Bakeman, R., & Gottman, J. M. (1986). Observing
of interest; and when processes and not outcomes
interaction: An introduction to sequential analysis.
are the focus (e.g., questions of contingency). Cambridge, England: Cambridge University Press.
Compared with other approaches, it is often labor- Bakeman, R., & Gottman, J. M. (1997). Observing interac-
intensive and time-consuming. Coding schemes— tion: An introduction to sequential analysis (2nd ed.).
the basic measuring instrument of behavioral Cambridge, England: Cambridge University Press.
observation—need to be developed and observers doi:10.1017/CBO9780511527685
trained in their reliable use, and the often-volumi- Bakeman, R., & Helmreich, R. (1975). Cohesiveness
and performance: Covariation and causality in an
nous data initially collected need to be reduced to
undersea environment. Journal of Experimental Social
simple rates and probabilities or contingency Psychology, 11, 478–489. doi:10.1016/0022-1031-
indexes for later analyses. Behavior can be observed (75)90050-5
live or recorded for later viewing (and re-viewing). Bakeman, R., & Quera, V. (1992). SDIS: A sequen-
Observers either assign codes to predetermined time tial data interchange standard. Behavior Research
intervals (interval recording) or detect and code Methods, Instruments, and Computers, 24, 554–559.
doi:10.3758/BF03203604
events in the stream of behavior (event recording),
Bakeman, R., & Quera, V. (1995). Analyzing interaction:
using instruments that vary from simple pencil and
Sequential analysis with SDIS and GSEQ. Cambridge,
paper to sophisticated computer systems. Coded England: Cambridge University Press.
data can be represented in a code-unit grid as inter- Bakeman, R., & Quera, V. (2009). GSEQ 5 [Computer
val, untimed event or multievent, or timed-event software and manual]. Retrieved from http://www.
data; the latter offers the most options but works gsu.edu/~psyrab/gseq/gseq.html
224
Bakeman, R., & Quera, V. (2011). Sequential analysis Konner, M. J. (1976). Maternal care, infant behavior,
and observational methods for the behavioral sciences. and development among the !Kung. In R. B. DeVore
Cambridge, England: Cambridge University Press. (Ed.), Kalahari hunter-gathers (pp. 218–245).
Bakeman, R., Quera, V., & Gnisci, A. (2009). Observer Cambridge, MA: Harvard University Press.
agreement for timed-event sequential data: A Martin, P., & Bateson, P. (2007). Measuring behaviour:
comparison of time-based and event-based algo- An introductory guide (3rd ed.). Cambridge, England:
rithms. Behavior Research Methods, 41, 137–147. Cambridge University Press.
doi:10.3758/BRM.41.1.137
Needleman, S. B., & Wunsch, C. D. (1970). A general
Bakeman, R., & Robinson, B. F. (1994). Understanding method applicable to the search for similarities in
log-linear analysis with ILOG: An interactive approach. the amino acid sequence of two proteins. Journal of
Hillsdale, NJ: Erlbaum. Molecular Biology, 48, 443–453. doi:10.1016/0022-
Chorney, J. M., Garcia, A. M., Berlin, K. S., Bakeman, 2836(70)90057-4
R., & Kain, Z. N. (2010). Time-window sequential Oller, D. K. (2000). The emergence of the speech capacity.
analysis: An introduction for pediatric psycholo- Mahwah, NJ: Erlbaum.
gists. Journal of Pediatric Psychology, 35, 1061–1070.
Parten, M. B. (1932). Social participation among pre-
doi:10.1093/jpepsy/jsq022
school children. The Journal of Abnormal and Social
Cohen, J. A. (1960). A coefficient of agreement for nomi- Psychology, 27, 243–269. doi:10.1037/h0074524
nal scales. Educational and Psychological Measurement,
Quera, V., Bakeman, R., & Gnisci, A. (2007). Observer
20, 37–46. doi:10.1177/001316446002000104
agreement for event sequences: Methods and
Cohn, J. F., & Kanade, T. (2007). Use of automated facial software for sequence alignment and reliability
image analysis for measurement of emotion expres- estimates. Behavior Research Methods, 39, 39–49.
sion. In J. A. Coan & J. J. B. Allen (Eds.), Oxford doi:10.3758/BF03192842
University Press Series in Affective Science: The handbook
Rodgers, J. L. (2010). The epistemology of mathemati-
of emotion elicitation and assessment (pp. 222–238).
cal and statistical modeling: A quiet methodologi-
New York, NY: Oxford University Press.
cal revolution. American Psychologist, 65, 1–12.
Douglass, W. (1760). A summary, historical and political, doi:10.1037/a0018326
of the first planting, progressive improvements, and
Sankoff, D., & Kruskal, J. (Eds.). (1999). Time warps,
present state of the British settlements in North-America
string edits, and macromolecules: The theory and
(Vol. 1). London, England: R. & J. Dodsley.
practice of sequence comparison. Stanford, CA: CSLI.
Ekman, P. W., & Friesen, W. (1978). Facial Action (Original work published 1983)
Coding System: A technique for the measurement
Stevens, S. S. (1946). On the theory of scales of mea-
of facial movement. Palo Alto, CA: Consulting
surement. Science, 103, 677–680. doi:10.1126/
Psychologist Press.
science.103.2684.677
Fleiss, J. L. (1981). Statistical methods for rates and pro-
Suen, H. K., & Ary, D. (1989). Analyzing quantitative
portions (2nd ed.). New York, NY: Wiley.
behavioral data. Hillsdale, NJ: Erlbaum.
Galisson, F. (2000, August). Introduction to computational
Wickens, T. D. (1989). Multiway contingency tables analy-
sequence analysis. Tutorial presented at the Eighth
sis for the social sciences. Hillsdale, NJ: Erlbaum.
International Conference on Intelligent Systems for
Molecular Biology, San Diego, CA. Retrieved from http:// Wolff, P. H. (1966). The causes, controls, and organiza-
www.iscb.org/ismb2000/tutorial_pdf/galisson4.pdf tion of behavior in the neonate. Psychological Issues,
5, 1–105.
Gottman, J. M. (1979). Marital interaction: Experimental
investigations. New York, NY: Academic Press. Yoder, P., & Symons, F. (2010). Observational measure-
ment of behavior. New York, NY: Springer.
Gros-Louis, J., West, M. J., Goldstein, M. H., & King,
A. P. (2006). Mothers provide differential feed- Yoder, P. J., & Tapp, J. (2004). Empirical guidance for
back to infants’ prelinguistic sounds. International time-window sequential analysis of single cases.
Journal of Behavioral Development, 30, 509–516. Journal of Behavioral Education, 13, 227–246.
doi:10.1177/0165025406071914 doi:10.1023/B:JOBE.0000044733.03220.a9
225

Behavioral Observation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Behavioral Observation

Uploaded by

Copyright:

Available Formats

Chapter 12

Preliminaries behavioral observations can be employed for either

Mutually Exclusive and Exhaustive

Recording Coded Data: From Events and Intervals Are Primary

You might also like