The authors have declared that no competing interests exist.
Conceived and designed the experiments: JM MG SS. Performed the experiments: JM. Analyzed the data: JM SS. Contributed reagents/materials/analysis tools: JM. Wrote the paper: JM MG SS.
Auditory and visual signals generated by a single source tend to be temporally correlated, such as the synchronous sounds of footsteps and the limb movements of a walker. Continuous tracking and comparison of the dynamics of auditory-visual streams is thus useful for the perceptual binding of information arising from a common source. Although language-related mechanisms have been implicated in the tracking of speech-related auditory-visual signals (e.g., speech sounds and lip movements), it is not well known what sensory mechanisms generally track ongoing auditory-visual synchrony for non-speech signals in a complex auditory-visual environment. To begin to address this question, we used music and visual displays that varied in the dynamics of multiple features (e.g., auditory loudness and pitch; visual luminance, color, size, motion, and organization) across multiple time scales. Auditory activity (monitored using auditory steady-state responses, ASSR) was selectively reduced in the left hemisphere when the music and dynamic visual displays were temporally misaligned. Importantly, ASSR was not affected when attentional engagement with the music was reduced, or when visual displays presented dynamics clearly dissimilar to the music. These results appear to suggest that left-lateralized auditory mechanisms are sensitive to auditory-visual temporal alignment, but perhaps only when the dynamics of auditory and visual streams are similar. These mechanisms may contribute to correct auditory-visual binding in a busy sensory environment.
The detection, localization, and identification of a visual stimulus are facilitated by a simultaneously presented sound that is spatially coincident or normatively associated
In addition to location- and object-based interactions between coincident auditory and visual stimuli, continuous auditory and visual streams interact based on dynamic congruence. These dynamic interactions have been studied predominantly for speech perception. For example, presentation of visual lip movements facilitates the perception of congruent auditory speech (
These facilitative effects of auditory-visual dynamic congruence during speech perception indicate that auditory and visual streams are integrated in the brain, at least when they share similar dynamics. Thus, it is plausible that, even for non-speech stimuli, some general perceptual mechanisms track the temporal alignment of auditory-visual dynamics. Anecdotally, we enjoy watching dancers glide and leap in temporal alignment with music at a concert, but we are displeased when images are misaligned with the sounds due to a transmission delay while watching live footage of an unfolding event on the news. Temporal alignment of auditory-visual dynamics is also likely to provide a general cue indicating that the two streams arise from a common source. For example, temporally aligned limb movements and footsteps often indicate that both dynamic signals originate from a single walker.
The neural substrates underlying the continuous tracking of auditory-visual synchrony, outside the domain of speech perception, are not well understood. Several studies suggest that left auditory cortex is specialized for processing rapidly varying features in sounds (
Support for this idea arises from a recent study examining an electroencephalographic (EEG) correlate of auditory-visual dynamic congruence using simple periodic stimuli, with the rates of auditory and visual modulation (2.1 or 2.4 Hz) either being identical (congruent condition) or different (incongruent condition)
Note that ASSR in the left-frontal ROI is reduced only in the AV-misaligned condition (d). Bar to the right indicates scale.
In the present experiments, we made several modifications to Nozaradan
An exception is a recent magnetoencephalography (MEG) study that showed that when people watched a feature movie with its own soundtrack, the phase coherence of oscillatory activity in the delta and theta frequencies increased within and across auditory and visual areas as compared to when they watched the same movie while listening to a soundtrack from a different movie
As in Nozaradan
The visualizer display was either synchronized with the music (the
Such a difference, however, could reflect a difference in attentional engagement with the music; people may be more strongly engaged with the music in the AV-aligned condition when the visual display pleasantly matches the music than in the AV-misaligned condition when the visual display appears disjointed from the music. Some studies have demonstrated that ASSR can be increased with greater auditory engagement
Given this approach, we reasoned that if attentional engagement with the music drives ASSR, ASSR should be largest in the AV-aligned condition in which a synchronized visualizer display would increase engagement with the music. ASSR should be reduced in the AV-misaligned condition in which a misaligned visualizer display would reduce engagement with the music, but ASSR should be most strongly reduced in the control conditions in which participants actively ignore the music to perform reading tasks, especially in the dual-task control condition in which attentional resources were most diverted from the music.
In contrast, our working hypothesis is that left-lateralized auditory mechanisms monitor the temporal alignment between auditory and visual dynamics irrespective of attentional engagement. Note that auditory-visual temporal alignment is meaningful only when auditory and visual dynamics are similar. We thus reasoned that the responses of left-lateralized auditory mechanisms to music, as probed via ASSR, would not be affected by reading text in the single- and dual-task control conditions because the dynamics of word presentations in those conditions are unrelated to the dynamics of the music. In this sense, the single- and dual-task control conditions provide an auditory-visual baseline that is common in everyday experience, such as listening to music while reading e-mails on a computer screen. Because we hypothesize that left-lateralized auditory mechanisms track auditory-visual temporal alignment irrespective of attentional engagement, we predict that ASSR would be equivalent in the single- and dual-task control conditions. When processing dynamically aligned auditory-visual signals, left-lateralized auditory mechanisms might increase activity (relative to the control conditions) because temporally aligned auditory-visual signals provide consistent information about a common source. In contrast, left-lateralized auditory mechanisms should reduce activity when processing dynamically similar but temporally misaligned auditory-visual signals (based on
In summary, we hypothesize that left-lateralized auditory mechanisms, known to process rapidly varying auditory features, also process temporal alignment between auditory and visual signals when their dynamics are similar. This predicts that left-lateralized ASSR should be largest in the AV-aligned condition, equivalently large or reduced in the two control conditions, and substantially reduced in the AV-misaligned condition. However, if the conditions instead influence ASSR based on the modulation of attentional engagement with the music, left-lateralized ASSR should be most reduced in the two control conditions, especially in the dual-task control condition, in which attention was strongly diverged from the music.
The experiments and consent forms were approved by Northwestern University Institutional Review Board (NUIRB). Written informed consent was obtained from each participant, and all investigations were conducted according to the principles expressed in the Declaration of Helsinki.
Twenty-eight (17 female) right-handed adults (18–29 y.o.) with normal hearing and normal or corrected-to-normal vision responded to a posting on the Northwestern University campus (Evanston, IL, USA), and received monetary compensation for their participation.
Each participant was seated in a comfortable armchair (to reduce muscle artifacts in EEG signals) at 120 cm from the display monitor (21′′, 1024×768 resolution, 60-Hz refresh rate). A six-minute MP3 recording of Beethoven’s
Each participant listened to an identical 2-minute portion of Beethoven’s
In the AV-aligned condition, changes in the luminosity of the features within the visualizer display matched changes in the intensity of the music (see below for verification). Other visual features such as color, speed, motion, and organization also dynamically changed seemingly in synchrony with the changes in auditory intensity and pitch, but these instances of auditory-visual synchronization were less apparent. We emphasize that our goal was to determine the role of left-lateralized auditory mechanisms in tracking auditory-visual alignment in complex naturalistic stimuli that included dynamics across multiple time scales in the context of multiple concurrently varying features, rather than to investigate the processing of a specific combination of features within a specific time scale as in
Participants were asked to attend to the music and the visualizer display, and to simply enjoy their listening/watching experience. They were not informed that the visualizer display was temporally aligned with the music in one condition and misaligned in the other. They were merely told that they would be watching two 2-minute instances of the same song with different visualizer displays. The stimuli (auditory and visual) were presented with a MacBook Pro laptop computer running OS 10.6.
The iTunes
The single- and dual-task control conditions were included because any reduction in ASSR in the AV-misaligned condition could potentially be due to attentional disengagement from the music when the visual display appeared misaligned with the music. In these control conditions, we presented participants with the same amplitude-modulated music as in the AV-aligned and AV-misaligned conditions, but we told participants to ignore the music and perform a reading task. Reading material consisted of the first 1,182 words of English text from the first chapter of
In the single-task control condition, participants viewed the words from the story presented in a randomized order and were asked to press a button when they saw a target word, “and.” In the dual-task control condition, in addition to asking participants to ignore the music and respond every time “and” was displayed, we presented the same set of words in the order of the original text and instructed participants to follow the story in preparation for a comprehension post-test. The comprehension test, consisting of 16 yes-no questions, was given immediately after the dual-task control condition. None of the participants had read the text previous to the experiment, but they all scored above chance (mean = 14/16), indicating that they made an effort to comprehend the story. Thus, in the dual-task control condition, we further reduced auditory engagement with the music by imposing a greater cognitive and working-memory load than in the single-task control condition.
Each control condition lasted about 10 minutes. Presentation software (version 11.0, Build 04.25.07,
The single- and dual-task control conditions were run before the AV-aligned and AV-misaligned conditions to avoid drawing attention to the music before it was time for participants to attend to the music. We note that there was no significant order effect for ASSR amplitudes in our ROI in either hemisphere (see below) across the four conditions (main effect of order
During each condition, EEG was continuously recorded using a 64-channel (10–20 configuration) Biosemi system with a nose reference as well as additional electrodes, one placed lateral to each eye for recording horizontal electro-oculographic (EOG) activity and one placed under the left eye for recording vertical EOG activity, including blinks. Data were sampled at 1024 Hz and bandpass filtered between 0.1 and 100 Hz. The resulting EEG data were segmented into 1-s epochs; epochs with eye blinks and muscle artifacts were manually removed based on vertical EOG activity (generally >100 µV, but adjusted for several participants as necessary), and epochs with saccades were manually removed based on horizontal EOG activity (>100 µV, but adjusted as necessary). The first 80 artifact-free epochs from each participant for each condition were transformed into CSD (Current Source Density) maps using CSDtoolbox Version 1.1 (
ASSR amplitude was computed (for each electrode and each participant) by averaging the EEG waveforms across the 80 epochs, taking a Fast-Fourier Transform (FFT) of the average waveform (using Matlab 7.4.0; Mathworks), then extracting the amplitude of the Fourier component at 40 Hz (at 1-Hz resolution). Averaging the EEG waveforms across the 80 epochs before taking a FFT reduced any contributions from non-phase-locked responses, thus isolating the stimulus-evoked auditory neural responses. CSD-transformed EEG signals offer a conservative estimate of the locations of the underlying neural generators
ASSR phase-locking was computed for each electrode and each participant by taking an FFT of the EEG waveform from each epoch, extracting the complex Fourier coefficient for the 40-Hz component, normalizing it by dividing by its amplitude, averaging these normalized complex coefficients across the 80 epochs, then taking the amplitude of the resultant complex number. This phase-locking measure is commonly referred to as inter-trial phase coherence or ITPC (
Oscillatory neural activity that was not phase-locked to the amplitude modulation of the music was computed for each electrode and each participant by taking an FFT of the EEG waveform for each epoch, averaging the Fourier amplitudes across the non-driven frequencies (±10 Hz relative to the driven frequency, 40 Hz), then averaging these mean amplitudes across the 80 epochs. The choice of averaging Fourier amplitudes over ±10 Hz around the driven frequency was made somewhat arbitrarily because we had no hypothesis about what frequency range should be influenced by auditory-visual dynamic alignment. Nevertheless, using a larger (±40 Hz) or smaller (±5 Hz) frequency range did not affect the statistical analyses presented in the
Stimulus-evoked ASSR, ITPC (phase-locking), and stimulus-induced activity were averaged over the scalp sites within the lateralized regions of interest (ROIs). The ROI’s were determined by generating a topographic map of ASSR (averaged across the AV-aligned and AV-misaligned conditions) and selecting the 30 most responsive electrodes, which included 15 electrodes in each hemisphere, yielding the left-frontal and right-frontal ROI’s (see
ASSR amplitudes were calculated from current-source density (CSD) transformed EEG scalp potentials. (
ASSR in the left-frontal ROI (
Auditory-visual dynamic alignment had little effect on the degree of phase-locking at 40 Hz in either ROI (ITPC = 0.318 [standard error of the mean; s.e.m. = 0.027] vs. 0.299 [s.e.m. = 0.024] in the AV-aligned vs. AV-misaligned conditions,
Asking participants to ignore the music and visually monitor for the target word in the single-task control condition caused little change in ASSR relative to the AV-aligned condition (
Bar to the right indicates scale.
The requirement to comprehend and remember the story while also performing the target-word task in the dual-task control condition substantially engaged attentional resources, evidenced by the fact that the response time and accuracy for responding to the target word were significantly degraded in the dual- versus the single-task control condition (response time: 637 ms [s.e.m. = 17] vs. 573 ms [s.e.m. = 12],
The lack of influence of the single- and dual-task control conditions on ASSR reasonably rules out the possibility that the left-lateralized reduction in ASSR in the AV-misaligned condition was due to attentional disengagement from the music. Overall, ASSR from the left-frontal ROI was equivalent in the AV-aligned and the two control conditions, but was selectively reduced in the AV-misaligned condition (
We tested the hypothesis that left-lateralized auditory mechanisms, proposed by some to process rapidly varying features in sounds, might also contribute to the tracking of the dynamic alignment of auditory and visual features. To test this hypothesis in a naturalistic context, we used music (Beethoven’s
Compared with when the visualizer displays were temporally aligned with the music, ASSR from the left-frontal (but not right-frontal) ROI was reduced when the visualizers were temporally misaligned relative to the music. There were no significant changes in the degree of phase-locking (ITPC) to the 40-Hz modulation or in the ongoing (non-stimulus-locked) oscillatory neural activity, suggesting that a dynamically misaligned complex visual display reduces the amplitude of left-lateralized auditory responses to music without measurably influencing the temporal fidelity of auditory responses or non-sensory responses, including those associated with cognition, emotion, or arousal.
The control conditions revealed that neither consciously ignoring the music, performing a visual task, nor substantially engaging attentional resources in a concurrent reading-comprehension task caused any reduction in ASSR. Some studies have reported that the level of auditory engagement modulates ASSR
The usual approach to investigating the mechanisms encoding auditory-visual dynamics has been to compare neural activity across conditions in which auditory and visual dynamics are similar but deviate subtly from synchronization
In summary, our results are consistent with the idea that left-lateralized auditory cortical mechanisms continuously track complex dynamic alignment between visual and auditory streams, but only when auditory and visual dynamics are similar, making the processing of dynamic auditory-visual alignment particularly useful for cross-modal binding in a busy sensory environment. The results are also consistent with the idea that the left auditory cortical specialization for the processing of rapidly varying features of sounds (for review, see
Histogram of the difference in the visual-luminance-vs.-auditory-intensity correlation (
(TIF)
Visualizer control experiments.
(DOCX)