Professional Documents
Culture Documents
Audiovisual Perception - Gerhard Daurer
Audiovisual Perception - Gerhard Daurer
at)
Audiovisual Perception
by Gerhard Daurer
Abstract
Our senses enable us to take in very diverse information about our surroundings, and they differ from one another in a
number of features specific to their individual modes. The eye specializes in the perception of spatial structure, and the ear in
the perception of temporal processes. Only in the rarest cases, however, are we confronted with sensory stimuli of only a single
modality; we perceive our world through all five senses and hence multimodally. Consequently, our sense organs are not, as is
often assumed, isolated from one another, since it is their synergetic interplay that gives human beings their evolutionary
advantage. Irrespective of the modality, the most reliable stimulus in a given situation dominates all the others; if one sense
provides too little or unclear information, all of the other senses enter as a corrective. The integration of multimodal sense
stimuli in meaningful units is called multimodal integration; to a certain extent, it already occurs on the neuronal level and
hence unconsciously and passively. Another often-used possibility to link stimuli beyond the boundaries between the senses is
known as intermodal analogy. This involves searching consciously and actively for an amodal quality that is present in several
sensory regions, such as intensity or brightness, in order to form analogies that transcend the boundaries between the senses.
These mechanisms, which are described very briefly here, are, of course, just two elements in the interaction of hearing and
seeing. Any consideration of audiovisual perception should at least distinguish between genuine synesthesia and other
associative or emotional links between image and sound.
www.see-this-sound.at/print/33.html 1/7
27/2/2021 See this Sound (http://see-this-sound.at)
By perceiving high-frequency light rays with wavelengths from 380 to 680 nm that move in nearly straight lines, the eye can
resolve surface structures with extremely precise detail. This gives us the capacity for extremely precise spatial orientation, as
is evident, for example, when we read tiny structures such as writing. The human auditory system, by contrast, is optimized to
understand the human voice. The sound waves in this range, with wavelengths from about 20 to 30 cm, bend around smaller
objects and hence are comparatively poorly suited for the precise perception of space.
Seeing is a targeted and directed process that takes place actively and consciously. As a rule, sensory stimuli are transferred
directly to the cerebrum for rational processing, which makes visual perception ideally suited for dealing with highly
differentiated inputs. Several nerve tracts lead directly from the ear to the diencephalon, or interbrain, which is responsible,
among other things, for controlling the emotions. That is why acoustic stimuli are able to trigger relatively direct feelings and
physical reactions (such as an increased pulse rate). Moreover, the ear is immobile and cannot be closed, which is why, like it
or not, we register all acoustical events in our surroundings, even when we are sleeping. Hearing often occurs unconsciously
and passively and can be described as totalizing and collectivizing, since acoustic perception will largely coincide for several
people in a room.
Our visual perception is optimized for grasping static objects and can, as a rule, follow no more than one movement, such as a
passing car. Acoustic phenomena, by contrast, are practically inconceivable without dynamic changes. When we listen, we
have no problem distinguishing between several simultaneous movements, such as the noises of two different cars driving
away. Moreover, the sense of hearing is fundamentally faster than the sense of vision in taking in and processing sensory
stimuli. The ear thus tends to specialize in the perception of temporal processes and the eye in the detailed resolution of static
phenomena, which is probably the basis of the common association of images with permanence and sounds with ephemerality.
[2]
Despite the creation of a means of segregating information on a sense-by-sense basis, evolution did not
eliminate the ability to benefit from the advantages of pooling information across sensory modalities.
Rather, it created an interesting duality: some parts of the brain became specialized for dealing with
information within individual senses, and others for pooling information across senses. [3]
Only in the rarest cases are we confronted with sensory stimuli of a single modality, since we perceive our environment with
five senses and thus in a multimodal way. In fact, separating our perception into several independent worlds — hearing,
seeing, smelling, tasting, feeling — is an enormous task of abstraction that human beings only learn as part of their cultural
socialization. In the literature, one sometimes comes across the informative observation that the media and technical
apparatuses of the nineteenth century contributed to the singularizing of our senses. For sound, movement, body, and
image had an immediate relationship in the history of culture at least until technical possibilities enforced
their separation … People stared as if paralyzed at the horns of phonographs. From that time onward,
specializing of certain bodily functions outside the body, and hence a specializing of the senses, was
necessitated.[4] Whether or not one chooses to accept this specific assessment, it cannot be denied that technical
development and human perception are closely intertwined. Ever since developments in twentieth-century media have made
synchronized recording and playback of image and sound possible, our perception and the use and the weighting of our senses
has transformed again: the separation of hearing and seeing has been undermined by technology. Whereas the primacy of the
eye that dominates in Western culture has come to seem increasingly problematic, the interaction of the senses has been
recognized as more important and been discussed more and more frequently. This is another reason why it makes sense to
speak of audiovisual perception. Several mechanisms involved in the complex interplay of hearing and seeing will be presented
below.
3 Multimodal Integration
The possibility of integrating multimodal sensory stimuli in meaningful units has many advantages for us — for example,
improved understanding of language. The enormous selectivity of our sense of hearing makes it possible to follow a voice
www.see-this-sound.at/print/33.html 2/7
27/2/2021 See this Sound (http://see-this-sound.at)
attentively even in a loud environment. If at the same time one watches the facial and lip movements of the person speaking,
understanding improves enormously. Multimodal integration thus means that the perception in the realm of one sense is
influenced by perception in another, since the two components are integrated into an interpretation which is as consistent as
possible. The linkage of visual and auditory stimuli does not result solely — as was long assumed — from mental construction;
it has been shown that various sensory stimuli already converge on the neuronal level in so-called multimodal neurons.[5]
Multimodal integration thus occurs on the lowest level of perception, sometimes even before the object is recognized.
In contrast to genuine synesthesia, which is considered absolute,[6] multimodal integration is dependent on the context: if
one sense provides too little or unclear information, other senses enter in as a corrective — for example, in dark surroundings,
auditory perception becomes more important for orientation in space. Irrespective of modality, in a given situation the most
reliable stimulus will dominate all the others. Factors in evaluating reliability naturally include attentiveness, experience,
motivation, and previous knowledge, which makes it clear that multimodal integration cannot be reduced to processes taking
place in multimodal neurons. In any case, the key to stable perception is the efficient combination and integration of
information of different modalities.
5 Perception-Intensifying Effects
Now the question arises as to which mechanisms of multimodal integration can lead to improved perception.[8] Stein et al.
report an experiment in which the sensitivity of individual multimodal neurons was tested.[9] The reaction to the bimodal
stimulus (flash of light plus beep) corresponded approximately to the sum (additive) of reactions to unimodal stimuli. By
analogy to the rule of space and the rule of time, a close connection between spatial and temporal coincidence was confirmed
on a neuronal basis: the reaction to spatially disparate stimuli was lower than the sum of the two separate stimuli
(subadditive), which corresponds to a weakening of the sensation. Stein and Meredith were also able to show particularly
strong (superadditive) effects with relatively weak but coincidentally bimodal stimuli.[10] Thus, reactions triggered by bimodal
stimuli are already distinct on the neuronal level from mode-specific components of reaction, which explains in part the
perception-intensifying effect of combinations of images and sounds. The phenomena described are probably the neurological
basis for Michel Chion’s programmatic observation: We never see the same thing when we also hear; we don‘t
hear the same thing when we see as well.[11]
6 Intermodal Analogy
In addition to modal qualities that occur exclusively for just one sense (for example, pitch for the sense of hearing, color for the
sense of sight), there are also amodal (or intersensory)[12] qualities that are perceived by multiple senses. The psychologist
Heinz Werner examined such phenomena in detail as early as the 1960s. When we say that a tone is strong or weak,
that a pressure is strong or weak, we are no doubt referring in all cases to a property that is the same in all
these sensory domains. Recent research has now shown that there are doubtless many more properties than
psychology previously assumed that, like intensity, can be called intersensory.[13] Werner listed the following
properties with which it is possible to establish analogies across the boundaries between the senses: intensity, brightness,
volume, density, and roughness. According to Michel Chion, these and other amodal qualities are in fact at the center of our
perception.[14]
Using these dimensions, it is possible to relate sensory impressions of widely different modalities to one another — that is, to
create intermodal analogies. The process used, in contrast to multimodal integration, occurs consciously and actively, since
one is searching for a criterion of comparison that is usually found in one of the amodal dimensions. For example, the question
‘What sound goes with this color?’ can be decided on the basis of brightness. The formation of such analogies is influenced by
the context, since the color or loudness of an object are not absolute values but can only be assessed in relation to their
surroundings. Intermodal analogies have features that tend to be regular from person to person (small interpersonal
www.see-this-sound.at/print/33.html 3/7
27/2/2021 See this Sound (http://see-this-sound.at)
variation), whereas synesthetic correspondences are very different (large interpersonal variation). In general, the dimensions
of brightness and intensity seem to be of central importance to the formation of intermodal analogies.[15]
The psychologist Albert Wellek examined similar connections in the 1920s. Experimental testing enabled him to compile a
list of six correspondences — so-called primeval synesthesias), which in his view could be found among all peoples at all times
and hence were fixed in human perception: thin forms correspond to high pitches, thick forms to low pitches, and so on. The
historically demonstrable universality of these simplest sensory parallels go so far that everyone, even today,
will consider all six correspondences valid and intelligible, at least in one of the given forms.[16] Our Western
notation clearly corresponds to these primeval synesthesias, since, for example, the pitches are depicted, by visual analogy, as
high or low, which makes intuitive sense to us.
This model represents a first step toward raising representations of human perception to a more complex level.
www.see-this-sound.at/print/33.html 4/7
27/2/2021 See this Sound (http://see-this-sound.at)
all footnotes
[1] See Michael Giesecke, Die Entdeckung der kommunikativen Welt (Frankfurt am Main: Suhrkamp, 2007), 240ff.
[2] So, overall, in a first attempt to an audiovisual message, the eye is more spatially adept, and the ear more temporally
adept. Michel Chion, Audio-Vision: Sound on Screen, trans. Claudia Gorbman (New York: Columbia University Press, 1994),
11.
[3] Barry Stein et al., “Crossmodal Spatial Interactions in Subcortical and Cortical Circuits,” in Crossmodal Space and
Crossmodal Attention, eds. Jon Driver and Charles Spence (Oxford: Oxford University Press, 2004), 25–50.
[4] Susanne Binas, “Audio-Visionen als notwendige Konsequenz des präsentablen ‘common digit’: Einige Gedanken zwischen
musikästhetischer Reflexion und Veranstaltungsalltag,” in Techno-Visionen: Neue Sounds, neue Bildräume, eds. Sandro
Droschl, Christian Höller, and Harald Wiltsche (Vienna: Folio, 2005), 112–20. — Trans. S. L.
[5] Barry Stein and Alex Meredith, The Merging of the Senses (Cambridge, MA: MIT Press, 1993).
[6] See Klaus-Ernst Behne, “Wirkungen von Musik,” in Kompendium der Musikpädagogik, eds. S. Helms, R. Schneider, and
R. Weber (Kassel: Bosse, 1995), 281–332.
[7] The rule of time is used a great deal in sound design, for example, when images associated with notes that are in fact wrong
are synthesized in our perception into harmonious impressions.
[8] [Multisensory stimuli] add depth and complexity to our sensory experiences and, as will be shown below, speed and
enhance the accuracy of our judgements of environmental events in a manner that could not have been achieved using only
independent channels of sensory information. Stein et al., “Crossmodal Spatial Interactions,” 25.
[12] In my view, the terms amodal and intersensorial are used synonymously in the literature.
[13] Heinz Werner, “Intermodale Qualitäten (Synästhesien),” in Handbuch der Psychologie, ed. Wolfgang Metzger,
(Göttingen: Hogrefe, 1965), 278–303. — Trans. S. L.
[15] The association of pitch with color via the dimension of brightness is probably the only meaningful analogy that can be
made between these two domains. A familiar procedure used by subjects seeking to correlate different sensory domains via the
dimension of intensity is ‘cross-modality matching.’ — Trans. N. W..
[16] Albert Wellek, “Die Farbe-Ton-Forschung und ihr erster Kongreß,” in Zeitschrift für Musikwissenschaft 9 (1927): 576–
584. — Trans. S. L.
[17] The enormous importance of sociocultural factors in perceptual processes is pointed out in the following citation from
Thomas Kuhn. What scientists perceive and how they perceive is always dependent on the paradigm within which they
operate: What is built into the neural process that transforms stimuli to sensations has the following characteristics: it has
been transmitted through education; it has, by trial, been found more effective than its historical competitors in a group’s
current environment; and, finally, it is subject to change both through further education and through the discovery of misfits
with the environment. Thomas Kuhn, The Structure of Scientific Revolutions (Chicago: University of Chicago Press, 1970),
196.
[18] Michael Haverkamp, Synästhetisches Design: Kreative Produktentwicklung für alle Sinne (Munich: Hanser, 2009)
www.see-this-sound.at/print/33.html 5/7
27/2/2021 See this Sound (http://see-this-sound.at)
www.see-this-sound.at/print/33.html 6/7
27/2/2021 See this Sound (http://see-this-sound.at)
see aswell
People Works Timelines
Michel Chion no information
Michael Haverkamp
Thomas Kuhn All Keywords
Konrad Lorenz Intermodale Analogie (Chap. 6, 7)
Alex M. Meredith Polysensualität (Chap. 2, 3, 4, 5)
Barry E. Stein synchronicity (Chap. 2, 4)
Albert Wellek
Heinz Werner
www.see-this-sound.at/print/33.html 7/7