The Sound of Pixels

Original article was published on Deep Learning on Medium


This paper proposes PixelPlayer, a system to ground audio inside a video (frames) without manual supervision. Given an input video, PixelPlayer separates the accompanying audio into components and spatially localizes these components in the video. PixelPlayer enables us to listen to the sound originating from each pixel in the video as shown in the next Figure.

“Which pixels are making sounds?” Energy distribution of sound in pixel space. Overlaid heatmaps show the volumes from each pixel.

To train PixelPlayer using a neural network, a dataset is needed. The authors introduce a musical instrument video dataset for the proposed task, called MUSIC (Multimodal Sources of Instrument Combinations) dataset. This dataset is crawled from Youtube but with no manual annotation. MUSIC dataset has 714 untrimmed videos of musical solos and duets. The next Figure shows the dataset statistics. In the dataset’s videos, the source-pixels of audio are not manually labeled. PixelPlayer is trained to learn these source-pixels with a self-supervision trick.

Dataset Statistics: a) Shows the distribution of video categories. There are 565 videos of solos and 149 videos of duets. b) Shows the distribution of video durations. The average duration is about 2 minutes.

The Pixelplayer has three main networks: (1) Video analysis network, (2) Audio analysis network, and (3) Audio synthesizer network. For each network, I will illustrate the testing setup, then highlight what is different during training. It is important to note that the inputs to the audio-synthesizer network have different dimensions during testing and training.

Video Analysis Network: Given a video, three frames are sampled. The ResNet model extract per-frame features with size T×(H/16)×(W/16)×K, where T=3 is the number of frames, H and W are the frame height and width and K is the number of output channels. The video-analysis network temporally pools the ResNet features and outputs a 3D tensor with (H/16)×(W/16)×K dimensions as shown in the next Figure. During testing, this 3D tensor is directly fed into the Audio synthesizer network. However, during training, an extra spatial pooling is applied. Thus, the 3D tensor collapses into a 1D vector with K dimensions during training.

The Video analysis network samples three frames from a video and generates a 3D tensor output (green cuboid). However, during training, an extra spatial pooling is applied on the 3D tensor to convert it into a 1D vector.

Audio Analysis Network: Given an input audio file (1D), it is converted into a spectrogram (2D). Then, the spectrogram is fed into an audio U-Net to split the sound into K components (3D) as shown in the next Figure. During testing, the input audio (S) comes from a single video. However, during training, the input audio (S) combines different audio singles from different videos to generate a complex audio input signal.

The Audio analysis network uses Short Time Fourier transform (STFT) to convert the 1D input audio wave into a 2D spectrogram. Then, an audio U-Net splits the spectrogram into K audio channels.

Audio Synthesizer Network: Given the video-analysis and audio-analysis networks’ outputs, the audio-synthesizer outputs a mask to be applied to the input spectrogram. The mask selects the spectral components associated with each pixel (video-analysis output). Finally, inverse STFT is applied to the masked spectrogram, corresponding to each pixel, to produce the final sound as shown in the next figure.

The audio-synthesizer network has two inputs, i.e., the outputs of the video-analysis and audio-analysis networks. The audio-synthesizer learns a spectrogram mask that assigns a sound to a particular pixel. Finally, the inverse Fourier transform generates an output audio signal per mask.

The next figure shows the three networks and highlights the main differences between the testing and training setup. PixelPlayer is trained to separate the combined audio signal (S1 + S2) back into two independent signals (S1, S2), where S1 and S2 are the audio signals of the first and second video, respectively. By training on this self-supervised trick, PixelPlayer learns to ground audio inside a video (frames/pixels) without manual supervision.

PixelPlayer is trained to separate the combined audio signal (S1 + S2) back into two independent signals (S1, S2), where S1 and S2 are the audio signals of the first and second video, respectively.

The next figure depicts the process of mixing two audio signals into a single spectrogram and then using the binary output mask to separate the audio signal.

Qualitative results on vision-guided source separation on synthetic audio mixtures. This figure employs the training, not the testing, setup. This experiment is performed only for quantitative model evaluation.

PixelPlayer can answer two questions given an audio-video signal: (1)Which pixels are making sounds?, and (2) What sounds do these pixels make? The first Figure in this article shows how PixelPlayer answers the first question. The next Figure shows how PixelPlayer answers the second question.

“What sounds do these pixels make?” Clustering of sound in space. Overlaid colormap shows different audio features with different colors.

This article presents qualitative evaluations only. For quantitative evaluation on the output audio signal, please refer to the paper. The following video presents extra vivid qualitative evaluation with their audio output signal.

My Comments

  • This paper requires elementary audio-processing background (e.g., spectrogram). I am interested in unsupervised/self-supervised approaches but with limited experience working with audio. So, this paper provides a smooth introduction to unsupervised approaches in the joint audio-video domain.
  • The paper is a bit confusing due to the differences between the training and testing setups. The audio-synthesizer network takes different input dimensions during training (1D video feature) and testing (3D video feature)! So, I am surprised the audio-synthesizer network works as presented.
  • The aggressive usage of temporal and spatial pooling for the video features seems to work because the videos have static scenes. I don’t think PixelPlayer generalizes to more complex videos. On another hand, this 2018 paper is an early evaluation on the joint audio-video domain. I am sure there is more advanced followup literature.
  • The main thing I like about this paper is the idea of mixing and separating audio files. The authors leverage the fact that audio signals are approximately additive. I wonder, is it possible to apply the mix-and-separate idea on images/videos?