Creating 2.5D visual sound for an immersive audio experience

June 11, 2019

What the research is:

A deep convolutional neural network (CNN) that converts single channel audio to binaural audio by leveraging video. We introduce the problem of creating immersive binaural audio from video, and we propose a solution that outputs what we call 2.5D visual sound. Visual frames reveal significant spatial cues that are strongly linked to the accompanying single-channel audio, even when the audio is explicitly lacking those spatial cues. Our multimodal approach recovers this link between visual and audio streams from unlabeled video. We have our FAIR-Play dataset containing videos with binaural audio, which is the first of its kind to facilitate research in both the audio and vision communities.

How it works:

Given unlabeled video as training data, we devise a Mono2Binaural deep CNN to convert single-channel audio to binaural audio by injecting the spatial cues embedded in the visual frames. In the example below, we observe from the video frame that a man is playing the piano on the left and a man is playing the cello on the right. Although we can’t sense the locations of the sound sources by listening to the mono recording, we can nonetheless anticipate what we would hear if we were personally in the scene by inference from the visual frames. Our approach makes use of this intuition.

We infer 2.5D visual sound by injecting the spatial information contained in the video frames accompanying a typical monaural audio stream.

During training, we mix the binaural audio tracks for a pair of videos to generate a mixed audio input. The network learns to separate the sound for each video conditioned on their visual frames. This video contains examples of binaural audio and our results. We also perform audio-visual source separation on predicted binaural audio and show that it provides a useful self-supervised representation for the separation task.

Why it matters:

Deepening our understanding of how to build multimodal perception is important for artificial intelligence to capture the richness of real-world sensory environments. This research introduces sensations of actually being in the context of the video — bridging the gap between the audio and visual experiences. This can be useful for research applications that enhance sensations, like designing hearing aids or augmented or virtual reality.