Ph.D Thesis

Ph.D StudentBen Esti Hadas
SubjectStatistical Methods for Speech Processing in Low Resource
DepartmentDepartment of Electrical and Computer Engineering
Supervisors PROFESSOR EMERITUS David Malah


Many speech processing systems are based on statistical modeling of speech signals, thus requiring relatively large-scale data-sets for training. As technology advances, computational effort and memory footprint are less of a problem for such systems, while the amount of data available for training is still challenging in many limited-data applications such as under-documented languages, speech of children, and mobile applications, where most users are not willing to invest much time and effort in recording themselves. For this setup we address two major speech processing tasks: a voice conversion task, in which a sentence said by a source speaker is converted to sound as if said by a target speaker, and a keyword spotting (KWS) task of detecting whether a given keyword was said or not, in a speech utterance.

Common voice conversion systems are based on a Gaussian Mixture Model (GMM), thus requiring at least several dozens of recorded sentences for training. The trained conversion function is linear, often producing muffled synthesized signals due to over-smoothing of the converted spectral envelope. We present a method for voice conversion for low data-resource applications, where the conversion process is expressed as a sequential estimation problem of tracking the target spectrum based on the observed source spectrum.

To improve the quality of the converted synthesized signals, we also present methods for enhancing the global variance of the converted signal.

Most voice conversion systems require a parallel training set, in which the two speakers say the same text. In this work we also address the non-parallel setup, where no assumptions are made regarding the uttered text of the training set. In this setup, in addition to training a conversion function, the source-target correspondence also needs to be evaluated. We present here a generalized version of an existing method, by using temporal context vectors to improve the source-target matching process and prove that it converges.

Standard KWS methods require medium-large phonetically segmented sets for training, and therefore are not adequate for limited-data environments.

In this work we propose a new KWS method, suitable for this setup, based on discriminative classifiers for words and sentences. We present a new histogram representation for words, obtained with respect to a pre-trained Gaussian Mixture Model (GMM). Sentences are represented by a fixed-length global feature vector, extracted from the response curve obtained by the word classifier. Dataset for training the GMM can be easily obtained since no annotation or labeling is required.

Non-keyword recordings can be easily obtained, as opposed to speech including the keyword, which needs to be specifically provided for each keyword, so a highly biased training set is a reasonable scenario. To avoid biased classifiers, we use bagging predictors for training both word and sentence classifiers.

According to our experiments, the proposed KWS system performs better than an HMM benchmark system for small training sets, and is more robust to highly variable signals, such as speech of children, and to noisy conditions - specifically, babble and car noise in a wide range of SNR values.