M.Sc Thesis

M.Sc StudentKidron Einat
SubjectAudio-Visual Cross-Modal Analysis
DepartmentDepartment of Electrical and Computers Engineering
Supervisors PROF. Yoav Schechner
PROF. Michael Elad


Cross-modal analysis is a natural progression beyond processing of single-source signals. Simultaneous processing of two sources can reveal information that is unavailable when handling the sources separately. Indeed, human and animal perception, computer vision, weather forecasting, and various other scientific and technological fields can benefit from such a paradigm. A particular cross-modal problem is localization: out of the entire data array originating from one source, localize the components that best correlate with the other. For example, auditory and visual data sampled from a scene can be used to localize visual events associated with the sound track. In this thesis we present a rigorous analysis of fundamental problems associated with the localization task. We then develop an approach that leads efficiently to a unique, high definition localization outcome. Our method is based on canonical correlation analysis (CCA), where inherent ill-posedness is removed by exploiting sparsity of cross-modal events. We apply our approach to localization of audio-visual events. The proposed algorithm grasps such dynamic audio-visual events with high spatial resolution, contrary to prior attempts. The algorithm effectively detects the pixels that are associated with sound, while filtering out other dynamic pixels, overcoming substantial visual distractions and audio noise. The algorithm is simple and efficient thanks to its reliance on linear programming, while being free of user-defined parameters. The algorithm results are also measured using a quantitative criterion we propose. In

addition, we refer to a fundamental phenomenon, which we term chorus ambiguity, in which several audio-visual events occur simultaneously.