M.Sc Thesis

M.Sc StudentRapaport Maya
SubjectFew-Shot Learning Neural Network for Audio-Visual Speech
DepartmentDepartment of Electrical and Computer Engineering
Supervisor PROF. Israel Cohen
Full Thesis textFull thesis text - English Version


Audio-Visual speech enhancement is the task in which given a noisy audio signal and the corresponding speaker video frames, the model would produce an enhanced audio signal containing only the target speaker’s voice, whereas the rest of the speakers and background noise would be suppressed. As opposed to humans, it is a significant challenge for audio-visual speech enhancement models to isolate and enhance a target speaker without having any previous familiarity with the speakers in the video. This problem is known as speaker dependency. It prevents speech enhancement models from performing in real-time applications, where a previous familiarity cannot be guaranteed.

In this thesis we look at the realistic problem of having only a small number of training samples of the target speaker. We propose a fast adaptation speech enhancement (FASE) model, a state-of-the-art audio-visual speech enhancement neural network. Our model is inspired by methods which were originally developed for the task of few-shot learning in image classification, and specifically the meta-learning approach. The model comprises a deep encoder-decoder architecture, which is trained by an outer network model for fast adaptation to enhance new speakers. Moreover, we propose an improved encoder-decoder architecture.

We show that when there is only a few samples of the target speaker, FASE model outperforms previous audio-visual speech enhancement models, both in speech quality and intelligibility of all speakers. The model also demonstrates an improvement in computational performance, which implies its potential application for mobile systems, such as smart hearing aids. In cases of more samples of the target speaker our model exhibits limitations, similarly to other few-shot learning methods.

Few-shot learning algorithms are customized and designed for the cases of very few training samples. This is a disadvantage, since we would like the performance to improve when introducing the network with more training data. Therefore, we investigate the limitations of few-shot learning algorithms when coping with more shots in the task of image classification. We observe instability of the performance due to the dependency of the results on the number of training samples. We propose an algorithm to overcome this disadvantage, using the feature vectors extracted from the images. Our algorithm manages to decrease the undesirable dependency and achieve a better stability. Since the proposed algorithm is general and not specifically designed for images, it has a great potential to fit other few-shot learning tasks. We believe that it can be also used for an improved independent audio-visual speech enhancement model.