Ph.D Thesis

Ph.D StudentDov David
SubjectMulti-Modal Signal Processing on Manifolds
DepartmentDepartment of Electrical and Computer Engineering
Supervisors PROF. Israel Cohen


Multi-modal signals, i.e., signals measured by multiple sensors of different types, often have different characteristics across the modalities, e.g., different dynamics, dimensions, and value range. Accordingly, various sources of data are expressed differently across the modalities such that part of them are common to the different modalities and others appear only in specific modalities. For example, a speech signal measured by a microphone and a video camera is considered a common source, while speech from other speakers is audio-specific. These unique characteristics are appealing in various applications such as the analysis of audio-visual scenes, but also raises fundamental questions related to the joint analysis of the multi-modal signals. The questions which we address in this thesis are how to obtain a representation of the signal according to the common source, while reducing the effect of modality-specific sources considered interferences; how to process data available in the different modalities only in certain time intervals; and how to measure to what extent the data in the different modalities correspond (“correlate”) to each other.

We address these questions from manifold learning perspective by the design of kernel-based geometric methods. Classical kernel-based methods are typically designed for analyzing data measured in a single sensor by learning geometric structures of high dimensional data. These methods provide low dimensional representations of the data via the eigenvalue decomposition of affinity kernels capturing local relations (affinities) between samples of the signal.

In this thesis, we address the problem of data fusion via the combination of affinity kernels constructed separately for each modality. We consider a particular combination of the kernels via the product function previously shown to provide a representation of data according to the common source. We introduce a new graph-theoretic interpretation to this approach relating the kernels to their corresponding single and multi-modal graphs. We analyze the relations between the connectivities of the graphs, and, based on this analysis, we further improve the fusion process by an improved method for the construction of the affinity kernels. Then, we extend the context of the fusion problem to a setting, where the data in the different modalities is only partially available. We show how the proposed fusion approach can be extended to this setting allowing to obtain a joint representation of signals, according to the common source, even when the multi-modal signals are available only in certain time intervals. Finally, we address the question to what extent signals from different sensors correspond to each other, i.e., contain similar content. By revisiting the graph-theoretic analysis of the product of kernels, we devise a measure for multi-modal correspondence and show how it can be efficiently updated in an online setting. We demonstrate the proposed approaches for data fusion and for measuring multimodal correspondences for various applications related to the analysis of complex audio-visual sound scenes. In particular, we report improved performance for different variants of the task of sound sources activity detection and audio localization in video.