Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Research Article

Psychological Science

Detection of Audiovisual Speech 24(4) 423­–431


© The Author(s) 2013
Reprints and permission:
Correspondences Without Visual Awareness sagepub.com/journalsPermissions.nav
DOI: 10.1177/0956797612457378
http://pss.sagepub.com

Agnès Alsius and Kevin G. Munhall


Queen’s University

Abstract
Mounting physiological and behavioral evidence has shown that the detectability of a visual stimulus can be enhanced by a
simultaneously presented sound. The mechanisms underlying these cross-sensory effects, however, remain largely unknown.
Using continuous flash suppression (CFS), we rendered a complex, dynamic visual stimulus (i.e., a talking face) consciously
invisible to participants. We presented the visual stimulus together with a suprathreshold auditory stimulus (i.e., a voice
speaking a sentence) that either matched or mismatched the lip movements of the talking face. We compared how long
it took for the talking face to overcome interocular suppression and become visible to participants in the matched and
mismatched conditions. Our results showed that the detection of the face was facilitated by the presentation of a matching
auditory sentence, in comparison with the presentation of a mismatching sentence.This finding indicates that the registration of
audiovisual correspondences occurs at an early stage of processing, even when the visual information is blocked from conscious
awareness.

Keywords
speech perception, consciousness
Received 1/22/12; Revision accepted 6/14/12

The ability to combine information presented in different sen- Some recent studies have indicated that when an audiovi-
sory modalities into unified percepts is an important aspect of sual speech stimulus is presented, the observer’s awareness of
an organism’s successful interaction with the external world. the visual input (Munhall, ten Hove, Brammer, & Paré, 2009)
Psychophysical studies have demonstrated that multisensory or acoustic input (Tuomainen, Andersen, Tiippana, & Sams,
integration enhances perceptual sensitivity, reduces ambiguity, 2005) is a key factor influencing audiovisual integration. These
and shortens reaction time (RT; Bolognini, Frassinetti, Serino, studies have involved the McGurk illusion (McGurk & Mac-
& Ladavas, 2005; Frassinetti, Bolognini, & Ladavas, 2002). Donald, 1976), in which the perception of an otherwise intelli-
For instance, a synchronous sound can improve detection of a gible speech sound is dramatically altered if the observer is
flash (McDonald & Ward, 2000; Noesselt, Bergmann, Hake, watching a phonologically conflicting visual (lip movement)
Heinze, & Fendrich, 2008), increase the perceived luminance utterance. To study whether audiovisual integration arises only
of a foveal light (Stein, London, Wilkinson, & Price, 1996), when the visual stimulus is consciously processed as a face,
and influence the perceptual organization of simple naturalis- Munhall et al. (2009) used a bistable percept, a dynamic ver-
tic visual scenes (Sekuler, Sekuler, & Lau, 1997; Shams, sion of Rubin’s (1915) classic face-vase illusion in which the
Iwaki, Chawala, & Bhattacharya, 2005; Vroomen & de Gelder, face moved and seemingly articulated a syllable. The results
2000) and complex naturalistic visual scenes (Arrighi, Marini, showed that a simultaneous sound was more likely to be inte-
& Burr, 2009; Kim, Kroos, & Davis, 2010). grated with the visual speech stimulus when the face, rather
Despite the many studies illustrating the consequences of than the vase, was consciously perceived as the dominant
multisensory integration at the behavioral and neural levels, figure. Tuomainen et al. (2005) combined pseudowords syn-
the actual mechanisms of such cross-modal enhancement and thesized as sine-wave speech with visual information from
the stages of processing at which it occurs remain largely speaking lips. In sine-wave speech, the formants of the
unknown. In our experiments reported here, we explored original speech are replaced with time-varying sinusoids; only
whether conscious awareness of visual sensory signals is
required for audiovisual integration to occur, or whether
Corresponding Author:
the visual and auditory signals interact to some extent at a Agnès Alsius, Department of Psychology, Queen’s University, Humphrey
preconscious level (i.e., prior to conscious visual sensory Hall, 62 Arch St., Kingston, Ontario, Canada K7L 3N6
processing). E-mail: aalsius@gmail.com
424 Alsius, Munhall

observers informed of its speechlike nature perceive a sine- studies have shown that audiovisual conditions lower the
wave stimulus as speech. Tuomainen et al. found that incongru- detection threshold for masked auditory stimuli. For instance,
ence between the visual and auditory information influenced Grant and Seitz (2000) showed that the ability to detect the
categorization of the sine-wave speech (the McGurk effect) presence of auditory speech in a background of noise was
only when the sounds were heard as actual speech, whereas the improved when observers also saw a face articulating the same
visual influence was negligible when participants were not utterance. Kim and Davis (2004) partially replicated this find-
aware of the speechlike nature of the stimuli. ing with a larger stimulus sample. However, both of these
In contrast to the results of these studies, mounting neural studies had limitations. Grant and Seitz used only three sen-
evidence indicates that audiovisual interactions can take place tences repeatedly in their entire experiment, and Kim and
at very early stages of processing, in primary sensory cortices Davis’s study contained no audiovisual control condition in
(Shams et al., 2005; Watkins, Shams, Tanaka, Haynes, & which the auditory and visual information were incongruent.
Rees, 2006) or even subcortically (see Driver & Noesselt, The aim of the present study was to investigate whether the
2008, for a review). Given that such cross-modal effects arise detectability of a complex visual signal, preconsciously pro-
early (Klucharev, Möttönen, & Sams, 2003), it seems possi- cessed, can be improved by a concurrent acoustic stream. The
ble that at least some intersensory interactions occur at a pre- use of a visual detection task rather than an auditory detection
conscious level of processing. The detectability of simple task is advantageous because the visual signal in an audiovi-
visual signals can be enhanced by the presence of simultane- sual event characteristically precedes the auditory signal in
ous task-irrelevant sounds (e.g., Bolognini et al., 2005). Audi- time and is thought to have a priming effect on auditory pro-
tory information also heightens perceptual awareness of cessing (Sánchez-García, Alsius, Enns, & Soto-Faraco, 2011).
visual stimuli (Frassinetti, Bolognini, Bottari, Bonora, & Thus, local expectancy effects were minimized in our design.
Ladavas, 2005) by, for instance, preventing adaptation to the An interocular suppression technique called continuous flash
visual information (Sheth & Shimojo, 2004) or extending the suppression (CFS; Tsuchiya & Koch, 2005) was used to keep
duration of a congruent visual scene’s dominance in binocular the visual stimulus (a talking face) outside of participants’
rivalry (Alais, van Boxtel, Parker, & van Ee, 2010; Chen, conscious awareness. In CFS, arrays of random overlapping
Yeh, & Spence, 2011; Conrad, Bartels, Kleiner, & Noppeney, rectangles of varying sizes and colors are presented succes-
2010; van Ee, van Boxtel, Parker, & Alais, 2009; see Schwartz, sively to one eye at a fast pace, reliably suppressing from con-
Grimault, Hupé, Moore, & Pressnitzer, 2012, for a recent sciousness an image presented to the corresponding location
review on multistable perception and cross-modal binding). of the other eye.
Finally, information perceived without awareness in one Several studies using the CFS paradigm have demonstrated
modality can sometimes bias the perceptual experience of that both low-level properties of visual stimuli, such as coher-
sensory information presented in another modality, leading to ence in motion (Conrad et al., 2010; Yamada & Kawabe,
some classic multisensory illusions (Bertelson, Francesco, 2011), and high-level aspects of visual stimuli can be pro-
Ladavas, Vroomen, & de Gelder, 2000; Dufour, Touzalin, cessed during interocular suppression (Jiang, Costello, & He,
Moessinger, Brochard, & Després, 2008; Leo, Bolognini, 2007). For instance, the right fusiform face area is activated
Passamonti, Stein, & Ladavas, 2008). Together, these find- even when faces are rendered consciously invisible through
ings suggest that intersensory interactions, at least for simple CFS (Jiang & He, 2006).
audiovisual pairings, can occur at early levels of processing, In the current experiments, the articulatory movements of
even when the visual or auditory modality is blocked from the suppressed talking face were either matched or mismatched
awareness. with an auditory sentence. In Experiment 1, we compared how
Simple multisensory stimuli commonly used in multisen- long it took for the talking face to overcome interocular sup-
sory experiments, such as beeps and light flashes, differ from pression and become visible to participants in the matching
naturally occurring events in a number of ways; for instance, and mismatching conditions (see Mudrik, Breska, Lamy, &
they are brief and have minimal internal structure. The Deouell, 2011, and Yang, Zald, & Blake, 2007, for similar
observer combines such stimuli on the basis of their similar paradigms). Differences between RTs in the two conditions
onset timing, close proximity, or both, creating an artificial would suggest that audiovisual correspondences can be
event. More-complex natural stimuli have inherent dynamics detected before visual information is consciously perceived.
in each modality that reflect the source of the stimuli, and In Experiment 2, to avoid anticipatory responses and ensure
these dynamics must be matched for multisensory integration that participants were following the instructions, we followed
to occur. It is notable that for natural stimuli, such as spoken the same procedure but included some catch trials in which the
language, variables such as onset timing and location are of face was not presented. Finally, in a control experiment
limited importance for integration (Munhall & Vatikiotis- (Experiment 3), we excluded the possibility that any differ-
Bateson, 2004). ence in performance between the matching and mismatching
The question of whether conscious awareness of sensory conditions was explained by representational dissociations
signals is necessary for the integration of complex natural sig- occurring at the first stages of partial awareness of the stimuli
nals remains unresolved. Within the speech domain, a few (i.e., when the talking face started emerging into dominance,
Audiovisual Integration Without Visual Awareness 425

rather than during effective suppression). If this were the case, of 84 English sentences. The recorded utterances were edited
faster RTs in one condition, compared with the other, could be to last 6 s using Adobe Premiere 6.0 software. The videos
attributed to differences in recognition speeds, motor (720 × 576 pixels at 29.97 frames/s; played using the Cine-
responses, or response criteria (Kouider & Dehaene, 2007). pack codec, Microsoft, Redmond, WA) were then converted to
Thus, in Experiment 3, we did not use interocular suppression; gray scale.
instead, the visual stimuli were digitally edited, and a gradual To ensure effective suppression of the face at the beginning
transition between the visual noise and the talking face (mim- of each trial, we reduced the contrast of the face by 84%,2
icking perceptions in Experiment 2) was presented. centered on the middle value of the luminance space. The con-
trast ramping during the trials was accomplished by comput-
ing the luminance of each pixel in each frame and scaling it by
Method the contrast adjustment percentage (0.13%) away from the
Participants center of the luminance space. Therefore, in each frame,
Experiment 1 had 16 participants (13 females, 3 males; mean the contrast increased by 0.13%, so that at the end of the trial,
age = 19.8 years), Experiment 2 had 37 participants (29 females, the contrast of the face was 60% relative to the contrast of the
8 males; mean age = 19.6 years), and Experiment 3 had 37 par- original image.
ticipants (28 females, 9 males; mean age = 20 years). All sub- The suppressor was first presented at full contrast; after the
jects were native speakers of English and reported having initial 1,000 ms, the contrast was decreased linearly by −0.6%
normal vision and no speech or hearing difficulties. The project per frame, so that by the end of the trial, the contrast had been
was approved by the local ethics committee. Participants reduced by 90%.3 The eye viewing the face and the eye view-
received monetary compensation. ing the suppressor were counterbalanced across trials and ran-
domly intermixed throughout the experiment. Both the image
of the talking face and the suppressor were circumscribed by a
Apparatus and data acquisition black rectangular border (19° × 16°).
The study was conducted in a dark sound booth. Stimuli were In the trials that presented matching audiovisual informa-
presented in stereo using a head-mounted display (HMD) with tion (50%), the visual sentence was presented together with its
integrated headphones (nVisor SX60, NVIS, Reston, VA). The original auditory recording. To equate the amount of visual
HMD contained two independent LCD screens, one for each information in the matching and mismatching trials, and to
eye, with a simultaneous refresh rate of 60 Hz. To calibrate the ensure that any difference between conditions was not
two displays, we measured the voltage drop across a photo explained by low-level stimulus differences (such as motion
resistor taped alternately onto each eyepiece; 16 levels of rate or local contrast differences), we used the same visual
luminance equally spaced across the luminance range were sentences in both conditions. The mismatching stimuli (50%
individually adjusted until the voltage drops matched. The of trials) were created by taking an original audiovisual stimu-
same photo resistor was used for all measurements, and the lus and replacing the auditory sentence with another sentence
results were confirmed through several readings, with eye- from the pool of 84 sentences. Each auditory sentence was
pieces alternated between measurements. The HMD was presented only once to each participant.
attached to a PC (Intel Core, Santa Clara, CA) running custom For the catch trials in Experiments 2 and 3, we created a
software that presented the stimuli synchronously to the two solid-gray video frame by computing the average luminance
eyes. Video was controlled by a GeForce 8400 GS graphic of the individual pixels that made up the face stimulus; the
card (NVIDIA, Santa Clara, CA). Screen tearing was avoided frame was presented instead of the face. Catch trials (22% of
by rendering frames for both eyes to a back buffer and then the trials in Experiments 2 and 3, 16 sentences) were randomly
presenting them upon screen refresh (i.e., page flipping). intermixed with genuine trials.
Responses were obtained with a response box connected to the Experiment 3 was designed to simulate the visual stimuli in
PC. RTs (time from stimulus onset to when participants indi- Experiment 2 but did not involve interocular suppression.
cated they saw the face) were independently computed for Instead, the video recordings of talking faces used in Experi-
each trial. Auditory speech stimuli were delivered through the ment 2 (including the contrast ramping) were digitally blended
headphones at an intensity of 73 dB(A) SPL. into the visual noise mask using the cross-dissolve transition
function4 of Adobe Premiere 6.0, and the output displays were
presented to both eyes (thus, on each trial, participants saw a
Stimuli smooth transition between the visual noise and the face). This
The suppressors used in Experiments 1 and 2 were colorful, condition has been used in previous binocular-rivalry studies
dynamic noise masks (i.e., overlapping rectangles, each mea- to control for partial awareness (Jiang et al., 2007; Mudrik et
suring 0.83° × 0.83°1) presented at a frequency of approxi- al., 2011), but it does not completely match the perceptual
mately 30 Hz. For all three experiments, audiovisual stimuli experience of rivalry in every aspect. Thus, the absolute mag-
were prepared from digital video recordings of a female nitude of the RT is not expected to be the same in this condi-
speaker (showing the lower part of the face) articulating a set tion as in the binocular-rivalry condition. Jiang et al. adjusted
426 Alsius, Munhall

the contrast’s rate of change for their control condition to bring Participants were instructed to close one eye at a time while
the RT in that condition artificially in line with the RT for their reading a short sentence and to verbally report any difference in
binocular-rivalry condition. Because absolute RT was not our visual clarity between their two eyes. After this calibration
concern, we did not make such adjustments. Partial awareness phase, participants first completed eight practice trials (four
was possible, in principle, with the superimposed stimuli in matching, four mismatching) to familiarize themselves with the
this control condition. Thus, Experiment 3 allowed us to rule experimental paradigm. None of the auditory or visual sen-
out partial awareness as an explanation for our results.5 tences presented in the practice trials were used in the experi-
If congruency effects were the result of processing occur- mental trials. Effective suppression was assessed after the
ring after the visual stimuli had already begun to overcome practice trials by participants’ subjective reports and was con-
suppression, similar differences in RTs between the matching firmed later with RT data. Participants who consistently saw the
and mismatching trials would be expected in the CFS experi- face at the beginning of the trials were excluded from analyses.
ments (Experiments 1 and 2) and the control experiment Figure 1 shows a schematic representation of the audiovi-
(Experiment 3). Any effects occurring at the first stages of par- sual stimuli used in the three experiments. Participants were
tial awareness of the face were expected to affect RTs equally instructed to maintain stable fixation at the center of the screen
in all experiments. Finding differences between the matching throughout the trials (no fixation point was presented) and to
and mismatching trials only in the CFS experiments would press a button on the response box as soon as they saw any part
allow us to reject the partial-awareness interpretation. of the face. They were also instructed to ignore the auditory
information. The recorded RT served as an index of the time
required to bring the face into conscious perception. Partici-
Procedure pants were instructed to not press the button if they did not see
Before we started the experimental sessions, we adjusted the a face (catch trials). After each trial, the experimenter trig-
HMD individually for each participant to account for interocu- gered the onset of the following trial by pressing a key on the
lar differences and to ensure correct visibility of both screens. computer keyboard.

Time (6 s)
CFS (Exp. 1, 2)

(e.g., More people would ride bikes if they didn’t have to worry about weather, traffic, and hills.)
Video

Control (Exp. 3)

(e.g., More people would ride bikes if they didn’t have to worry about weather, traffic, and hills.)

Matching
Audio

(e.g., More people would ride bikes if they didn’t have to worry about weather, traffic, and hills.)

Mismatching

(e.g., She hadn’t been concerned about her studies before, but she is now that the final exam is getting close.)

Fig. 1. Illustration of the stimulus sequence in the three experiments. Experiments 1 and 2 used continuous flash
suppression (CFS): Participants were presented with a dynamic, colorful noise mask to one eye and a video of a talking face
to the other eye; the contrast of the mask decreased across frames, while the contrast of the face increased. In Experiment
3, a control experiment, the visual stimuli from Experiment 2 were digitally blended into the visual noise mask, and the
output display was presented to both eyes, so participants saw a smooth transition between the visual noise and the face.
Participants were instructed to press a button as soon as any part of the face became visible. In matching trials, the visual
sentence stimulus was accompanied by the auditory recording of the same sentence. In mismatching trials, the visual
stimulus was accompanied by the auditory recording of another sentence, selected randomly.
Audiovisual Integration Without Visual Awareness 427

Experiment 1 included 56 genuine trials (no catch trials), between their RTs on matching trials and their RTs on mis-
divided into two blocks of 28 trials; Experiments 2 and 3 each matching trials).
included 56 genuine trials plus 16 catch trials, divided into two In contrast to the results for the CFS experiments, the
blocks of 36 trials. results for Experiment 3, the control experiment, did not reveal
differences in RTs between the matching (M = 4,550 ms)
and mismatching (M = 4,547 ms) conditions, t(36) = 0.30, p =
Results .766, d = 0.313. Furthermore, an analysis of variance with
Figure 2 summarizes overall RTs as a function of stimulus experiment (Experiment 2, Experiment 3) as a between-partic-
type (matching, mismatching) in the three experiments. RTs ipants factor and stimulus type (matching, mismatching) as a
longer than 6 s (0.02%, 0.008%, and 0.002% of trials in within-participants factor revealed a significant Experiment ×
Experiments 1, 2, and 3, respectively) were excluded from Stimulus Type interaction, F(1, 72) = 4.49, p < .05, thus rein-
further analyses. In Experiment 1, a two-tailed paired t test forcing the finding that between-condition differences in per-
revealed a significant effect of stimulus type on RT, t(15) = formance in the CFS experiments are not likely to have been
−2.83, p = .01, d = −0.708; average RT was shorter in the caused by partial awareness. In Experiments 2 and 3, subjects
matching condition (M = 2,629 ms), compared with the mis- were very accurate in withholding responses in the catch trials
matching condition (M = 2,764 ms).6 Experiment 2 repli- (M = 97% and 95%, respectively), F(1, 73) = 2.76, p = .10,
cated this effect: Even though RTs were longer overall than d = 0.386.
in Experiment 1 (possibly because participants performed an
extraperceptual visual verification process to avoid false
alarms in catch trials), there was still a reliable difference Discussion
between the matching (M = 2,898 ms) and mismatching (M = The main finding of these three experiments is that a com-
2,988 ms) conditions, t(36) = −2.10, p < .05, d = 2.12. The plex visual stimulus rendered invisible by interocular sup-
fact that congruency between auditory and visual informa- pression can emerge into conscious perception more quickly
tion modulated RTs suggests that audiovisual correspon- when it is combined with a correlated acoustic signal. That
dence is detected even when the visual information is is, congruency between auditory and visual information
suppressed and therefore preconsciously processed. affected the time the visual stimulus remained suppressed,
The analysis of participants’ RTs revealed substantial with matching visual speech stimuli consistently breaking
individual differences across participants, with some par- through interocular suppression more quickly than mis-
ticipants detecting the face early in the sequence and others matching stimuli did. The effect cannot be attributed to dif-
much later. Note, however, that 12 (75%) of the 16 subjects ferences between the matching and mismatching visual
in Experiment 1 and 24 (65%) of the 37 subjects in Experi- stimuli (e.g., differences in contrast or motion speed) because
ment 2 had a shorter RT for the matching stimuli, compared the same stimuli were used in the two conditions. Further-
with the mismatching stimuli, and there was no systematic more, the effect is unlikely to be accounted for by partial
pattern relating this difference in RT to how early the rec- awareness because no differences in RTs between the match-
ognition of the face occurred (i.e., how early or late partici- ing and mismatching trials were found in the control experi-
pants detected the face did not predict the difference ment (Experiment 3).
Together, our results extend to the cross-modal binding
Matching domain previous findings showing that visual grouping can
occur preconsciously (e.g., Mitroff & Scholl, 2005). Our
Mismatching
results also suggest that audiovisual interactions transpire at
5,000 relatively early levels of processing, where neural information
4,500
associated with the suppressed visual stimuli is still registered.
These findings add to growing evidence that the detectability
Reaction Time (ms)

4,000 of a visual stimulus undergoing conscious suppression can be


enhanced by a concurrent sound (Alais et al., 2010; Bolognini
3,500 et al., 2005; Chen et al., 2011; Conrad et al., 2010; Frassinetti
* et al., 2002; Frassinetti et al., 2005; Sheth & Shimojo, 2004;
*
3,000 van Ee et al., 2009) and show that such cross-modal gains can
2,500 generalize to real-world, naturalistic stimuli that change
dynamically over time.
2,000 Previous studies examining whether awareness is a critical
Experiment 1 Experiment 2 Experiment 3 factor for the audiovisual integration of speech stimuli have
yielded mixed results (Grant & Seitz, 2000; Kim & Davis,
Fig. 2.  Mean reaction time in Experiments 1, 2, and 3 as a function of stimulus
type (matching, mismatching). Asterisks indicate significant differences 2004; Munhall et al., 2009; Tuomainen et al., 2005), possibly
between conditions (p < .05). Error bars represent 1 SEM. because of differences in tasks (e.g., identification of phonetic
428 Alsius, Munhall

content vs. detection of the presence of speech). Whereas stud- variations in the timing of changes in the mouth aperture and
ies that required phonetic categorization of the speech stimuli in the onsets and offsets of movement were temporally related
(Munhall et al., 2009; Tuomainen et al., 2005) have pointed to to the patterning of the acoustical energy in the matching con-
a crucial role of consciousness in audiovisual integration, dition but not in the mismatching condition. It is plausible,
studies that involved detection tasks (including the present therefore, that such comodulation of the auditory and visual
study) have provided evidence for preconscious binding of signals in the matching condition was registered by systems
correlated cross-modal information. responsible for establishing cross-modal correspondence
These differing results are not necessarily incompatible, (Bernstein, Auer, & Takayanagi, 2004; Eskelund et al., 2011;
however. The auditory and visual sensory systems interact at Grant & Seitz, 2000; Kim & Davis, 2004; Kim et al., 2010).
multiple levels of processing during speech perception (see Increased brain activations in response to temporally corre-
Eskelund, Tuomainen, & Andersen, 2011; Hertrich, Mathiak, sponding audiovisual streams have been found in a broadly
Lutzenberger, Menning, & Ackermann, 2007; Klucharev et distributed neural system encompassing the superior collicu-
al., 2003). It is possible, therefore, that the encoding of pho- lus, fusiform gyrus, lateral occipital cortex, extrastriate visual
netic information requires integration over a longer time scale cortex, and superior temporal regions (Stevenson, Altieri,
than the initial interaction of vision and audition requires Kim, Pisoni, & James, 2010). The results of the present study
(Munhall et al., 2009). Indeed, electroencephalography studies suggest that at least some of these areas may be activated even
have shown that the earliest interactions between the auditory when visual information is preconsciously processed. The
and visual properties of an audiovisual stimulus occur approx- activation in these areas would subsequently amplify the neu-
imately 85 ms after stimulus onset, whereas the first phonetic ral response in visual regions responsible for the perceptual
audiovisual interactions are observed 155 ms after onset of an representation of the face, enhancing the signal-to-noise ratio
auditory stimulus (Klucharev et al., 2003). Further evidence derived from visual rivalry. Such amplification of the neuronal
for the multistage account of audiovisual speech integration response in visual cortices is possibly mediated by bidirec-
has come from studies showing that brain areas responsive to tional cortical interactions between higher-order association
the processing of cross-modal coincidence of physical events regions and early sensory areas (Schroeder & Foxe, 2002) or
do not overlap with those involved in cross-modality fusion by lateral circuitry between the auditory and visual regions
(Miller & D’Esposito, 2005). Still more evidence has come (Falchier, Clavagnier, Barone, & Kennedy, 2002).
from studies showing that the stage underlying the audiovisual Because the auditory stimuli were comprehensible to the
detection advantage is not specific to speech, whereas the participants in our study, perhaps they used high-level linguis-
stage in which the phonetic information of an audiovisual tic information (semantic, syntactic, lexical, or phonological)
stimulus is encoded (e.g., the McGurk illusion) is specific to to generate probabilistic hypotheses and thus anticipate repre-
speech (Eskelund et al., 2011). Therefore, we believe it is pos- sentations of the visual words in advance of their appearance
sible that early audiovisual interactions (e.g., early detection (Sánchez-García et al., 2011). Participants may also have used
of audiovisual spatiotemporal correspondences) occur prior to visual imagery to predict the upcoming visual signal (see Pear-
conscious visual awareness, but that audiovisual interactions son, Clifford, & Tong, 2008). Chen et al. (2011), for example,
affect phonetic categorization only after the conscious showed that when subjects were instructed to attend to an
processing of sensory signals (see Munhall et al., 2009, and audio soundtrack, the soundtrack’s semantic content influ-
Tuomainen et al., 2005). enced their visual awareness of static pictures in a binocular-
The observed detection benefits for audiovisual stimuli rivalry paradigm.
could be derived from structural coupling between auditory However, we do not believe that the effects found in our
and visual events resulting from the inherent relationship studies can be explained solely by such high-level expectan-
between acoustic speech sounds and their visible generators cies. First, participants were explicitly instructed to ignore the
(Fowler, 2004). That is, in addition to matching in phonologi- auditory information, and the auditory sentence in a given trial
cal content, the concurrent streams of auditory and visual was not predictive of the type of trial being presented (22% of
speech information are usually invariant in their onsets and the trials were catch trials, and the remaining trials were
offsets, rhythmic patterning, duration, intensity, and overall equally split between the matching and mismatching condi-
amplitude contour (Chandrasekaran, Trubanova, Stillittano, tions). However, it is possible that participants in fact did not
Caplier, & Ghazanfar, 2009). Mouth movements (e.g., peak ignore the irrelevant auditory information and used it to antici-
opening and closing) correlate with peaks in the acoustic pate the upcoming mouth movements. Note, however, that if
waveform, and the visual kinematics correspond with the tim- this were the case, the same expectancy effects likely would
ing of changes in the auditory envelope in great detail (see have resulted in faster responses in the matching condition
Yehia, Rubin, & Vatikiotis-Bateson, 1998). This detailed cor- than in the mismatching condition of the control experiment,
respondence appears to be important for audiovisual speech but they did not.
integration. Second, it has been shown that even in the case of a language
Although in our study, the visual stimuli were identical in that participants do not know, visually guided enhancements of
the matching and mismatching audiovisual conditions, auditory speech perception can occur when the masked auditory
Audiovisual Integration Without Visual Awareness 429

signals are accompanied by visual sentences (Kim & Davis, CFS conditions. Note, however, that this may not necessarily have
2003); presumably, such visual enhancements of auditory stim- been the case. The rate of transition between dominance of the flash
uli are due to local correspondences between the two signals. stimulus and the face could have been shorter in the control condition
Further research examining whether an effect similar to the one because of the very straightforward changes in stimulus contrast that
reported here arises when the suppressed speech is in an occurred; in comparison, the transitions in the CFS condition were
unknown language will allow researchers to determine whether much more variable.
the enhancement of the visual signal occurs at a prephonologi- 6. All d values were corrected for dependence between means, using
cal level or at higher levels of linguistic processing. Morris and DeShon’s (2002) Equation 8.
In conclusion, our findings constitute novel evidence that
a temporally dynamic, complex visual speech stimulus that References
is only preconsciously processed can be integrated with supra- Alais, D., van Boxtel, J. J., Parker, A., & van Ee, R. (2010). Attending
threshold auditory information, and that such integration to auditory signals slows visual alternations in binocular rivalry.
heightens the robustness of residual signals associated with the Vision Research, 50, 929–935.
suppressed visual speech stimulus. Although the neural mecha- Arrighi, R., Marini, F., & Burr, D. (2009). Meaningful auditory
nism that mediates the enhancement of the visual signal remains information enhances perception of visual biological motion.
open for investigation, this behavioral finding suggests that Journal of Vision, 9(4), Article 25. Retrieved from http://www
acoustic signals modulate brain activity at an early stage of .journalofvision.org/content/9/4/25.full
visual processing, possibly by amplifying the visual signal and Bernstein, L. E., Auer, E. T., Jr., & Takayanagi, S. (2004). Auditory
thus facilitating its conscious processing (Shams et al., 2005). speech detection in noise is enhanced by lipreading. Speech Com-
munication, 44, 5–18.
Declaration of Conflicting Interests Bertelson, P., Francesco, P., Ladavas, E., Vroomen, J., & de Gelder, B.
The authors declared that they had no conflicts of interest with (2000). Ventriloquism in patients with unilateral visual neglect.
respect to their authorship or the publication of this article. Neuropsychologia, 38, 1634–1642.
Bolognini, N., Frassinetti, F., Serino, A., & Ladavas, E. (2005).
Funding “Acoustical vision” of below threshold stimuli: Interaction
This research was supported by the Natural Sciences and Engineering among spatially converging audiovisual inputs. Experimental
Research Council of Canada. A. A. was supported by a postdoctoral Brain Research, 160, 273–282.
fellowship from Spain’s Ministry of Education. Chandrasekaran, C., Trubanova, A., Stillittano, S., Caplier, A., &
Ghazanfar, A. A. (2009). The natural statistics of audiovisual
Notes speech. PLoS Computational Biology, 5, e1000436. Retrieved
1. All visual-angle measurements in this article are based on the from http://www.ploscompbiol.org/article/info%3Adoi%2F10
specifications of the product and varied slightly among participants .1371%2Fjournal.pcbi.1000436
because of individual adjustments. Chen, Y.-C., Yeh, S.-L., & Spence, C. (2011). Crossmodal constraints
2.  Image manipulations were based on the luminance of the original on human perceptual awareness: Auditory semantic modulation
images, and therefore only relative values (i.e., percentages) can be of binocular rivalry. Frontiers in Psychology, 2, 1–12. Retrieved
provided here. The specific ramping percentages we used were estab- from http://www.frontiersin.org/Perception_Science/10.3389/fpsyg
lished on the basis of previous pilot data regarding the threshold at .2011.00212/full
which the flash stimulus could reliably suppress the face at the begin- Conrad, V., Bartels, A., Kleiner, M., & Noppeney, U. (2010). Audio-
ning of the trial and the contrast at which the face would become visual interactions in binocular rivalry. Journal of Vision, 10(10),
visible before the end of the trial. Article 27. Retrieved from http://www.journalofvision.org/
3.  For the visual noise ramping, the same algorithm was applied to content/10/10/27/F1.expansion.html
reduce the contrast over the course of the trial; the contrast-adjust- Driver, J., & Noesselt, T. (2008). Multisensory interplay reveals
ment percentage (0.6%) reduced the contrast range toward the center crossmodal influences on “sensory-specific” brain regions, neu-
of the luminance range. The contrast adjustment for any given frame ral responses, and judgments. Neuron, 57, 11–23.
of a visual sequence can be computed by multiplying the time (in Dufour, A., Touzalin, P., Moessinger, M., Brochard, R., & Després,
seconds) by the frame duration (29.97 ms) by the step size (0.13 for O. (2008). Visual motion disambiguation by a subliminal sound.
the face, 0.6 for the noise) and adding or subtracting (as appropriate) Consciousness and Cognition, 17, 790–797.
the obtained value from the initial contrast adjustment of that stimu- Eskelund, K., Tuomainen, J., & Andersen, T. S. (2011). Multistage
lus (84% for the face, 0% for the visual noise). audiovisual integration of speech: Dissociating identification and
4.  The weight of each pixel was gradually changed from 100% of its detection. Experimental Brain Research, 208, 447–457.
weight in the visual noise stream (initial frame) to 100% of its weight Falchier, A., Clavagnier, S., Barone, P., & Kennedy, H. (2002). Ana-
in the talking-face stream (final frame). tomical evidence of multimodal integration in primate striate cor-
5.  This interpretation, however, rests on the assumption that the time tex. Journal of Neuroscience, 22, 5749–5759.
window over which the partial-awareness effects would be observed Fowler, C. A. (2004). Speech as a supramodal or amodal phe-
(the near-threshold stimulation) was similar in the control and the nomenon. In G. Calvert, C. Spence, & B. E. Stein (Eds.), The
430 Alsius, Munhall

handbook of multisensory processes (pp. 189–202). Cambridge, Mudrik, L., Breska, A., Lamy, D., & Deouell, L. (2011). Integration
MA: MIT Press. without awareness: Expanding the limits of unconscious process-
Frassinetti, F., Bolognini, N., Bottari, D., Bonora, A., & Ladavas, E. ing. Psychological Science, 22, 764–770.
(2005). Audio-visual integration in patients with visual deficit. Munhall, K., ten Hove, M., Brammer, M., & Pare, M. (2009). Audio-
Journal of Cognitive Neuroscience, 17, 1442–1452. visual integration of speech in a bistable illusion. Current Biol-
Frassinetti, F., Bolognini, N., & Ladavas, E. (2002). Enhancement ogy, 19, 735–739.
of visual perception by crossmodal visuo-auditory interaction. Munhall, K. G., & Vatikiotis-Bateson, E. (2004). Spatial and tempo-
Experimental Brain Research, 147, 332–343. ral constraints on audiovisual speech perception. In G. A. Calvert,
Grant, K. W., & Seitz, P. (2000). The use of visible speech cues for C. Spence, & B. E. Stein (Eds.), The handbook of multisensory
improving auditory detection of spoken sentences. Journal of the processing (pp. 177–188). Cambridge, MA: MIT Press.
Acoustical Society of America, 108, 1197–1208. Noesselt, T., Bergmann, D., Hake, M., Heinze, H. J., & Fendrich,
Hertrich, I., Mathiak, K., Lutzenberger, W., Menning, H., & R. (2008). Sound increases the saliency of visual events. Brain
Ackermann, H. (2007). Sequential audiovisual interactions dur- Research, 1220, 157–163.
ing speech perception: A whole-head MEG study. Neuropsycho- Pearson, J., Clifford, C. W., & Tong, F. (2008). The functional impact
logia, 45, 1342–1354. of mental imagery on conscious perception. Current Biology, 18,
Jiang, Y., Costello, P., & He, S. (2007). Processing of invisible stimuli 982–986.
advantage of upright faces and recognizable words in overcom- Rubin, E. (1915). Synsoplevede Figurer [Visual figures]. Copenha-
ing interocular suppression. Psychological Science, 18, 349–355. gen, Denmark: Gyldenhal.
Jiang, Y., & He, S. (2006). Cortical responses to invisible faces: Dis- Sánchez-García, C., Alsius, A., Enns, J. T., & Soto-Faraco, S. (2011).
sociating subsystems for facial-information processing. Current Cross-modal prediction in speech perception. PLoS ONE,
Biology, 16, 2023–2029. 6(10), e25198. Retrieved from http://www.plosone.org/article/
Kim, J., & Davis, C. (2003). Hearing foreign voices: Does knowing info%3Adoi%2F10.1371%2Fjournal.pone.0025198
what is said affect masked visual speech detection? Perception, Schroeder, C., & Foxe, J. (2002). The timing and laminar profile of
32, 111–120. converging inputs to multisensory areas of the macaque neocor-
Kim, J., & Davis, C. (2004). Investigating the audio-visual speech tex. Cognitive Brain Research, 14, 187–198.
detection advantage. Speech Communication, 44, 19–30. Schwartz, J. L., Grimault, N., Hupé, J. M., Moore, B. C., & Press-
Kim, J., Kroos, C., & Davis, C. (2010). Hearing a point-light talker: nitzer, D. (2012). Multistability in perception: Binding sensory
An auditory influence on a visual detection task. Perception, 39, modalities, an overview. Philosophical Transactions of the Royal
407–416. Society B: Biological Sciences, 367, 896–905.
Klucharev, V., Möttönen, R., & Sams, M. (2003). Electrophysiologi- Sekuler, R., Sekuler, A. B., & Lau, R. (1997). Sound alters visual
cal indicators of phonetic and non-phonetic multisensory inter- motion perception. Nature, 385, 308. doi:10.1038/385308a0
actions during audiovisual speech perception. Cognitive Brain Shams, L., Iwaki, S., Chawala, A., & Bhattacharya, J. (2005). Early
Research, 18, 65–75. modulation of visual cortex by sound: An MEG study. Neurosci-
Kouider, S., & Dehaene, S. (2007). Levels of processing during ence Letters, 378, 76–81.
non-conscious perception: A critical review of visual masking. Sheth, B. R., & Shimojo, S. (2004). Sound-aided recovery from and
Philosophical Transactions of the Royal Society B: Biological persistence against visual filling-in. Vision Research, 44, 1907–
Sciences, 362, 857–875. 1917.
Leo, F., Bolognini, N., Passamonti, C., Stein, B. E., & Ladavas, E. Stein, B. E., London, N., Wilkinson, L., & Price, D. (1996). Enhance-
(2008). Cross-modal localization in hemianopia: New insights on ment of perceived visual intensity by auditory stimuli: A psycho-
multisensory integration. Brain, 131, 855–865. physical analysis. Journal of Cognitive Neuroscience, 8, 497–506.
McDonald, J. J., & Ward, L. M. (2000). Involuntary listening aids Stevenson, R. A., Altieri, N. A., Kim, S., Pisoni, D. B., & James,
seeing: Evidence from human electrophysiology. Psychological T. W. (2010). Neural processing of asynchronous audiovisual
Science, 11, 167–171. speech perception. NeuroImage, 49, 3308–3318.
McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing Tsuchiya, N., & Koch, C. (2005). Continuous flash suppression reduces
voices. Nature, 265, 746–748. negative afterimages. Nature Neuroscience, 8, 1096–1101.
Miller, L. M., & D’Esposito, M. (2005). Perceptual fusion and stimu- Tuomainen, J., Andersen, T., Tiippana, K., & Sams, M. (2005). Audio-
lus coincidence in the cross-modal integration of speech. Journal visual speech perception is special. Cognition, 96, B13–B22.
of Neuroscience, 25, 5884–5893. van Ee, R., van Boxtel, J. J., Parker, A. L., & Alais, D. (2009). Mul-
Mitroff, S. R., & Scholl, B. J. (2005). Forming and updating object tisensory congruency as a mechanism for attentional control over
representations without awareness: Evidence from motion- perceptual selection. Journal of Neuroscience, 29, 11641–11649.
induced blindness. Vision Research, 45, 961–967. Vroomen, J., & de Gelder, B. (2000). Sound enhances visual per-
Morris, S. B., & DeShon, R. P. (2002). Combining effect size esti- ception: Cross-modal effects of auditory organization on vision.
mates in meta-analysis with repeated measures and independent- Journal of Experimental Psychology: Human Perception and
groups designs. Psychological Methods, 7, 105–125. Performance, 26, 1583–1590.
Audiovisual Integration Without Visual Awareness 431

Watkins, S., Shams, L., Tanaka, S., Haynes, J.-D., & Rees, G. (2006). Yang, E., Zald, D. H., & Blake, R. (2007). Fearful expressions gain
Sound alters activity in human V1 in association with illusory preferential access to awareness during continuous flash suppres-
visual perception. NeuroImage, 31, 1247–1256. sion. Emotion, 7, 882–886.
Yamada, Y., & Kawabe, T. (2012). Illusory line motion and transfor- Yehia, H., Rubin, P., & Vatikiotis-Bateson, E. (1998). Quantitative
mational apparent motion during continuous flash suppression. association of vocal-tract and facial behaviour. Speech Commu-
Japanese Psychological Research, 54, 348–359. nication, 26, 23–43.
Copyright of Psychological Science (Sage Publications Inc.) is the property of Sage Publications Inc. and its
content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's
express written permission. However, users may print, download, or email articles for individual use.

You might also like