טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentDavid Dov
SubjectAudio Visual Speech Processing Using Diffusion Maps
and the Scattering Transform
DepartmentDepartment of Electrical Engineering
Supervisor Full Professor Cohen Israel
Full Thesis textFull thesis text - English Version


Abstract

Voice activity detection in the presence of highly non-stationary noise and transient interferences is an open problem. State-of-the-art voice activity detectors which are based on statistical models usually assume that the statistics of noise is slowly varying with respect to speech. This assumption does not hold for transient interferences which are short time interruptions, and the performance of these detectors significantly deteriorates. One solution is to incorporate a video signal which is invariant to the acoustic environment. Although several voice activity detectors based on the video signal were recently presented in the literature, merely few detectors which are based on both the audio and the video signals exist in the literature to date. In this thesis, we present two different approaches for voice activity detection. Both approaches incorporate supervised learning procedures and a labeled training data set is considered. In the first approach, we exploit the video signal and present an algorithm for audio-visual voice activity detection. The algorithm comprises a feature extraction procedure, where the features are designed to separate speech from non-speech frames. Diffusion maps is applied separately and similarly to the features of each modality to embed them in a low dimensional representation. Using the new representation, we propose a measure for voice activity which is based on a supervised learning procedure and the variability between adjacent frames in time. The measures of the two modalities are merged to provide voice activity detection based on both the audio and the video signals. Experimental results demonstrate that the incorporation of both audio and video signals is highly beneficial for voice activity detection. In addition, the improved performance of the proposed algorithm compared to state-of-the-art detectors is demonstrated. In the second approach, we focus on the audio signal and propose an algorithm for voice activity detection which is designed to operate in the presence of transients. We propose a continuous measure for voice activity based on the SVM classifier. The measure of voice activity is constructed in a features domain, where the features are based on the scattering transform, include noise estimation, and are designed to separate speech and non-speech frames. Experimental results demonstrate that the proposed algorithm outperforms state-of-the-art detectors for different types of background noises, and in particular accurately classifies frames which contain transient interferences.