When listening to someone in a noisy environment, such as a cocktail party, we can understand the speaker more easily if we can also see his or her face. Movements of the lips and tongue convey additional information that helps the listener’s brain separate out syllables, words and sentences.

But exactly where in the brain this effect occurs and how it works remain unclear.

To investigate, Bruno L Giordano, of the France Institute of Neuroscience and Psychology, University of Glasgow, and colleagues, scanned the brains of healthy volunteers as they watched clips of people speaking. The clarity of the speech varied between clips.

Word Recognition

Furthermore, in some of the clips the lip movements of the speaker corresponded to the speech in question, whereas in others the lip movements were nonsense babble. As expected, the volunteers performed better on a word recognition task when the speech was clear and when the lips movements agreed with the spoken dialogue.

[caption id=“attachment_91172” align=“aligncenter” width=“680”]Experimental paradigm and analysis. Experimental paradigm and analysis.
(A) Stimuli consisted of 8 continuous 6 min long audio-visual speech samples. For each condition we extracted the acoustic speech envelope as well as the temporal trajectory of the lip contour (video frames, top right: magnification of lip opening and contour).
(B) The experimental design comprised eight conditions, defined by the factorial combination of 4 levels of speech to background signal to noise ratio (SNR = 2, 4, 6, and 8 dB) and two levels of visual informativeness (VI: Visual context Informative: video showing the narrator in synch with speech; VN: Visual context Not informative: video showing the narrator producing babble speech). Experimental conditions lasted 1 (SNR) or 3 (VIVN) minutes, and were presented in pseudo-randomized order.
(C) Analyses were carried out on band-pass filtered speech envelope and MEG signals. The MEG data were source-projected onto a grey-matter grid. One analysis quantified speech entrainment, i.e. the mutual information (MI) between the MEG data and the acoustic speech envelope (speech MI), as well as between the MEG and the lip contour (lip MI), and the extent to which these were modulated by the experimental conditions. A second analysis quantified directed functional connectivity
(DI) between seeds and the extent to which this was modulated by the experimental conditions. A final analysis assessed the correlation of either MI or DI with word-recognition performance. Relevant variables in deposited data.[/caption]

Watching the video clips stimulated rhythmic activity in multiple regions of the volunteers’ brains, including areas that process sound and areas that plan movements.

Speech is itself rhythmic, and the volunteers’ brain activity synchronized with the rhythms of the speech they were listening to. Seeing the speaker’s face increased this degree of synchrony.

However, it also made it easier for sound-processing regions within the listeners’ brains to transfer information to one other.

Notably, only the latter effect predicted improved performance on the word recognition task. This suggests that seeing a person’s face makes it easier to understand his or her speech by boosting communication between brain regions, rather than through effects on individual areas.

Further work is required to determine where and how the brain encodes lip movements and speech sounds. The next challenge will be to identify where these two sets of information interact, and how the brain merges them together to generate the impression of specific words.

Bruno L Giordano, Robin A A Ince, Joachim Gross, Philippe G Schyns, Stefano Panzeri, Christoph Kayser Contributions of local speech encoding and functional connectivity to audio-visual speech perception eLife 2017;6:e24763 doi: 10.7554/eLife.24763

© 2017 eLife Sciences Publications Ltd. Republished via Creative Commons Attribution license. Top Image: israeltourism/Flickr

For future updates, subscribe via Newsletter here or Twitter