Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 40

Perceptual Organization

Perfecto Herrera
Introductory sound
Perceptual Organization

“…Perceptual Organization is central to the key question of

perception: how do we make the leap from information
detected by our sensory receptors… to our perceptions of
the world? This requires not just the detection of information
by the organization of that information into veridical percepts.”

“…Perceptual organization is the process by which particular

relationships among potentially separate elements (including
parts, features, and dimensions) are perceived (selected from
alternative relationships) and guide the interpretation of those
elements… in sum, how we process sensory information in
Pomerantz & Kubovy, 1986
The problem(s) of perceptual

Some terms
 Source – the physical entity that gives rise to the sound
pressure waves (e.g. a violin being played)
 Stream – the percept of a group of successive and/or
simultaneous sounds as a coherent whole appearing to come
from a single source (e.g., the brass section)
 The sounds we hear at any one time usually come from a
number of different sources.
 In most cases we can hear and identify each of the different
sound sources as having its own pitch, timbre, loudness and
location (stream=source). In other cases several sources are
processed as a single stream as their features do not qualify
for being considered as “distinct” (e.g., string section). In other
–exotic- cases, a single source may yield different streams.
Auditory Scene Analysis
 A computational theory of hearing is required; plus a
functional explanation of the information processing
problems that the auditory system must solve in order
to make sense of the acoustic environment
 A computational theory of hearing deals with the
question: what is the purpose of hearing? Which are
the constraints and regularities hearing can exploit?
 Work in computer vision has benefited from a
computational theory since the late 1970’s, due to
David Marr
 A similar foundation for hearing was developed by
Albert Bregman at McGill University in Montreal and is
known as auditory scene analysis
Auditory Scene Analysis
ASA can be conceptualized as a two-stage process:
1. The mixture of sounds is decomposed into a
collection of sensory elements (onsets, pitch
trajectories, modulations, spectral tracks, etc.)
2. Elements that are likely to have arisen from the
same event are grouped to form a perceptual
structure (stream) which can be interpreted by
higher centers in the brain

For example, when listening to a violin performance, it is the

task of auditory scene analysis to group the acoustic events
emitted from the physical source (the violin) into a perceptual
stream (the mental experience of a violin being played).
Is this the only way of listening? What about “reduced listening”?
Read Pierre Schaeffer
Auditory Scene Analysis

In most listening situations, a

mixture of sounds reaches the
ears. However we can:
 Attend to one conversation
amid many competing voices
and other background sounds
(e.g. music) at a ‘cocktail
 Follow the melodic line played
“Auditory image” of Bach’s
Mass in Bm, consisting of
by the violins in an orchestral
voice, violin, cello etc.
How does the auditory
This problem is of great scientific system process this image
interest, and a solution also to recover a description of
has engineering applications each source?
-> The Holy Grial!!!
Active Perception: expectation-based
processing (bottom-up + top-down)
Auditory Scene Analysis

 The inner ear separates sound into its

frequency components
 At some point in the auditory system these
components need to be assigned to the
appropriate sound source
 Often called “perceptual grouping”, or
“auditory scene analysis”
 Two aspects: simultaneous grouping and
sequential grouping
Auditory Scene Analysis
 Simultaneous grouping – the grouping
together of the simultaneous
frequency components that come
from a single source
 Sequential grouping – the connecting
over time of the changing frequencies
that a single source produces from
one moment to the next
simultaneous grouping and
sequential grouping
Antecedents: Gestalt Psychology
 Gestalt means “pattern“ in German
 Gestalt Psychology originated in early XXth century: Max
Wertheimer (1880-1943), Wolfgang Köhler (1887-1967) and
Kurt Koffka (1886-1941)
 The basic principles underlying Gestalt psychology are
 The whole is greater than the sum of the parts
 The parts are defined by the whole as much as vice versa
 Gestalt psychologists are best known for their work in vision –
but their principles are also applicable to auditory perception.
 They systematically developed a set of principles of perceptual
organisation (believed to be innate) that they thought
determine how we assemble or associate components in a
perceptual field
 These principles are…
Gestalt Psychology Principles

 Proximity Bottom Up:

Hard wired,
 Similarity Pre-attentive,
Not Learned
 Common Fate (Common Direction) (primitive)

 Good Continuation
Top Down:
 Disjoint Allocation (Belongingness) Plastic,
 Closure (schema-driven)

 In vision when elements in an image

are close together they are perceived
to be together and separate from
others that are further away, even
though they are similar
 In hearing, sounds occurring together
over time are clustered
 Two or more auditory events are grouped if they are similar in
timbre, pitch, loudness or close in apparent location or time
 Fundamentals in same region but harmonics are not, leads to
fission i.e. Different timbres but same pitch = unfused
 Harmonics in same region but fundamentals not, leads to fusion
i.e. Different pitches but same timbre = fused
 This is not clear-cut –depends on individual differences.
 If the difference in loudness is large enough they form different
streams – either can be attended to
 Same dB  single stream at twice the tempo
Common Fate
 Components in sound act together
 They tend to start and finish together
 They tend to change in pitch or intensity together
 Therefore if we have a complex sound and the
components are co-ordinated then they are fused, e.g.
onset disparities, and AM and FM (tremolo & vibrato)
 For example if harmonics 2,4 and 8’s frequency is
modulated (FM) they separate from harmonics 3,5,6 and 7
 Or if the frequency of the 1st harmonic is modulated (FM)
at a different rate it separates from harmonics 3,4 and 5
Good Continuation
 Natural sound sources tend change gradually rather
than abruptly in frequency, intensity, location or timbre
 Abrupt change  new stream  new source
 Low and high tones tend to split into streams – this can
be suppressed by putting glides in between In speech
if there are oscillations in frequency it gives the
impression that there are two speakers saying the one
 In music in general if a note is near in pitch to the one
just before it then it will be heard as the next note in
the melody rather than a note that is separate - higher
or lower
Disjoint Allocation (Belongingness)
 One component can only come from one source –
i.e. hearing tries to use each component only once
 Say we have two tones at slightly different pitches
and these can either be heard in isolation or
embedded in another series of pitches – thus In
isolation the order of AB or BA is easily judged.
 The addition of pitches (X’s) that are close in pitch to
AB act as distracters making it difficult to order AB
(This is thought to be because we attend more to the
start and end of sequences).
 But if more X’s are added, they form a stream that is
separate from AB and again the order of AB is easily
 This not hard & fast – ambiguity is possible and this is it more
shows that this level of organisation is on the difficult to tell if A
boundary of being pre-attentive and attentivesounded before B?
 It also shows how the addition of new elements (assume fast tempo)
changes the perceptual organisation of the stimulus.
 A source maybe obscured or absent – but its percept
 e.g. FM radio – disturbance from ignition of passing
cars – we hear a click over the sound whereas in fact
the radio is producing only a click and the sound is off
 A pitched sound that is broken but the gap is filled by
noise seems unbroken

 Similarly a glide that is broken but the gap is filled with

noise seems unbroken
Auditory Scene Analysis
Bregman re-examines the Gestalt principles
and proposes the simultaneous and
sequential grouping cues as the basic
elements of information that help to organize
our perception: what, when, where, how

Bregman, A. S. (1990) Auditory scene analysis:

the perceptual organisation of sound.
Cambridge, Mass.: The MIT Press
But see also:
Wang, D. & Brown, G. (Editors) (2006).
Computational Auditory Scene Analysis:
Principles, Algorithms and Applications. New
York: Wiley.
Example of CASA-based
auditory segmentation

Frequency (Hz)




0 0.5 1 1.5 2 2.5
Time (s)

An Auditory Scene Analysis Approach to Speech Segregation, Wang (2005)

Simultaneous Grouping
Sequential Grouping
Simultaneous grouping

Some cues:
 Fundamental Frequency and Spectral Regularity

 Onset Timing

 Correlated changes in Amplitude or Frequency

 Sound Location

 Important: A single cue may not be effective all

the time – these cues work together for
perceptual organisation of the input sound
Fundamental Frequency
 Consider two musical instruments each playing a note
 It is easier to hear each note and each instrument if they are
playing different notes (have different fundamental frequencies)
 Simultaneous sounds are more likely to fuse if they have the
same fundamental frequency
 It has been shown that a pair of simultaneously presented
vowels are easier to identify if their fundamental frequencies
Spectral Regularity

 Perceptual fusion of the frequency

components from a harmonic sound –
harmonicity – heard as a single sound
 If a frequency component does not form
part of the harmonic series it tends to be
heard out separately – as if part of a
different source
Onset disparities

 Perceptual separation on tones

enhanced by onset asynchrony.
 A frequency component that stops or
starts at a different time from the
complex sound is less likely to be
heard as part of it than if it is
simultaneous with it
 Importance to make a “soloist”
standing out
Onset disparities

 We can hear each of two ‘simultaneously’

played notes easier if there is a small onset
difference between them
 These onset asynchronies are up to 30ms –
so the percept is still of the notes sounding
 The auditory system can exploit these onset
differences even though we are not
consciously aware of them
 Ensemble playing – completely
Correlated Changes in
Amplitude or Frequency
 A sound may be perceptually segregated
from an unchanging background if its
components are modulated in amplitude or

 Hear harmonic complex tone

 Harmonics 1, 3, 5, 6, 7 remain steady
 Harmonics 2, 4, and 8 rise and fall in
frequency four times
 Hear the two sets as separate sounds
Sound Location

 Sounds coming from different

locations in space are generally
assumed to be from different sources
 But… this is a weak cue for
simultaneous grouping; it becomes
stronger for sequential grouping
Sequential organisation
 Events in the world occur over time. We organise sounds into
sequences over time using various criteria
 Events that are similar in some way (e.g. in loudness or pitch) or
going in the same direction (e.g. rising or falling) are perceived to
have the same origin.
 Music uses this principle
 Streams are created by differences in pitch, loudness, timbre,
repetition rate etc and by combining these in different ways.
 Characteristics of Streams:
 Streams are separate – we only attend to one fully at a time.
 Foreground and Background – possibly 3 maximum
 Streams organisation is relative rather than absolute
 Stream organisation may change as the complexity of the
stimulus changes
 Some aspects of streaming are pre-attentive, others are
attentive, i.e. attentive means that by attending to different
aspects of a stimulus we hear different things
Sequential grouping

 Periodicity cues: periodic oscillations help to

group objects according to their rates
 Spectral cues: we tend to group in time
elements that appear in the same spectral
regions (e.g., high partials vs. low partials)
 Level (intensity) cues: we tend to group in
time elements of similar level
 Spatial cues: we tend to group in time
elements coming from the same place
Features Important for
Sequential Grouping
 Spectral distribution (old+new

Heard as: Heard as:

 What happens when pitch separation and/or repetition rate
are varied?
 If we compress the time dimension do we hear notes that are
further apart in frequency belonging together?
 This was tested by Van Noorden (1976,1977), who found:
 Segregation depends on repetition rate and pitch separation
 When stream segregation occurs, we are unable to attend
fully to the events in both streams at the same time
 We find it difficult to distinguish the order of events across
 We have trouble hearing the overall rhythm of the sequence

 Frequency and temporal contiguity –

auditory streaming
Freq. separation
The Figure-Ground
Phenomenon and Attention
 Generally we do not attend to every aspect of the
auditory input – certain parts are selected for
conscious analysis
 Complex sound is analysed into streams – we attend
to one stream at a time – attended stream stands out
perceptually – rest of sound becomes less salient
 Separation into attended and unattended streams is
equivalent to the ‘figure-ground phenomenon’
 Examples: Attending to one conversation at a time at a
party – other conversations form a background; music
with soloists; TV + noisy home…
 Importance of changes – the listeners’ attention is
usually drawn to aspects of the sound that are
changing – it becomes figure while the relatively
unchanging part(s) become background
Guess who wrote this text:
“It is not enough to be able to describe the
response of single cells, nor predict the
results of psychophysical experiments. Nor
is it enough even to write computer
programs that perform approximately in the
desired way: One has to do all these things
at once, and also be very aware of the
computational theory...”
This presentation reused materials from
educational and research slides and
documents by
 Dan Ellis
 Guy Brown
 Niall Griffith
 Rianna Walsh
 Chris Darwin
 Sue Denham

You might also like