Simultaneous multimodal experiences are an important facilitator of perceptual learning in humans. However, building a unified model for modality merging in machine learning systems is challenging due to various factors: (i) differences in learning dynamics between modalities, (ii) noise topologies , with some modality flows containing more information for the task at hand. than others, and (iii) specialized input representations.
The distinction between audio and visual input representations is particularly clear: many state-of-the-art audio classification systems rely on short-term Fourier analysis to produce log-mel spectrograms, which are frequently used as inputs to CNN structures. developed for images. These time-frequency representations differ from images in that many acoustic objects can have energy at the same frequency. Therefore, CNN translation invariances may no longer be desirable (while an acoustic object may be time shifted, a frequency shift could completely change the meaning).
While distinct spatial parts of the image correspond to different objects, the visual flow in a video is three-dimensional (one temporal and two spatial), and there is the unique difficulty of high redundancy across many images. As a result, the input representations, and thus the neural network topologies and references, varied significantly across modalities. For simplicity, the most common multimodal fusion paradigm is an ad hoc approach that involves combining distinct audio and visual networks through their output representations or partitions, a process known as “late fusion”.
Google researchers recently published a study describing a new transformer-based model for audiovisual fusion in video. Despite their origins as NLP models, transformers have recently gained popularity as universal perceptual models due to their ability to describe rich symbolic correlations while making few assumptions about their inputs (and because perceptual inputs continuums can be symbolized).
Transformers have been shown to perform competitively for image (ViT), video (ViViT) and, more recently, audio classification by breaking dense continuous signals into patches and rasterizing them into 1D tokens (AST) . Since these models can gracefully handle sequences of different lengths, a natural first extension would be to provide a transformer with a sequence of visual and auditory patches, with few architectural changes.
A free flow of attention between distinct spatial and temporal regions in the image, as well as across frequency and time in the audio spectrogram, is possible with this concept of “early fusion”. The researchers believe that full paired attention to all layers of the model is unnecessary since the audio and visual inputs contain rich and fine information, much of which is redundant. Due to the quadratic complexity of attention paired with symbolic sequence length, such a model would not scale well to longer films.
To counter this, the team proposes two strategies to limit the flow of attention in our model. The first is based on a typical multimodal learning model in which cross-modal flow is restricted to later layers of the network, allowing earlier layers to focus on training and mining single-mode models. This is now called “intermediate melting”, the melting layer being the layer where cross-modal interactions are introduced. “Early melting” (all layers are cross-modal) and “late melting” (all layers are unimodal) are the two extreme examples, which researchers compare as benchmarks.
Restricting cross-modal attention flow between tokens within a layer is the second method (and key contribution). To do this, the team allows attention to flow freely within a modality while forcing the model to gather and “condense” information from each modality before sharing it with others. The basic idea is to create a small number of latent merging units that function as an “attention bottleneck,” forcing cross-modal interactions within a layer through them. Researchers show that this “bottlenecked” version, known as a multimodal bottleneck transformer (MBT), outperforms or competes with its unrestricted counterpart while costing less to compute.
AudioSet, Epic-Kitchens-100, and VGGSound are three video categorization datasets the team is experimenting with. Our backbone architecture is identical to that of ViT; specifically, the researchers use ViT-Base initialized from ImageNet-21K, but they point out that the method is backbone independent and can be used for any other transformer backbone.
Multimodal fusion beats the better-performing single-modality baseline in all datasets, illustrating the value of complementary data. The team points out that the relative value of modalities for ranking labels varies (audio-only has higher relative performance for AudioSet and lower for Epic-Kitchens, while audio and visual baselines are equally strong for VGGSound). This is partly due to the dataset’s annotation technique, and it sets VGGSound apart as a fusion-ready dataset.
The researchers also found that AV fusion improves the performance of datasets that were previously purely video, such as Kinetics and Moments in Time. The team also looked at performance by class on the Audioset dataset and found that for almost all (57 out of 60) of the top classes (ranked by overall performance), AV Fusion outperforms AV Fusion.
Google researchers propose a new transformer architecture (MBT) for audio-visual fusion and study a variety of fusion tactics based on the cross-attention of latent tokens. They propose a novel technique for limiting cross-modal attention via a small number of fusion “bottlenecks” and show that it outperforms vanilla cross-attention at reduced computational cost, achieving state-of-the-art results on a variety of references. MBT will be extended to other modalities in the future, such as text and optical flow.