טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentZeev Litichever
SubjectClassification of Transition Sounds with Application to
Speech Recognition
DepartmentDepartment of Electrical Engineering
Supervisors Dr. Dan Chazan
Full Professor Merhav Neri


Abstract

Classification of speech sounds is one of the building blocks of large vocabulary continuous speech recognition systems. Current systems are especially adapted for recognition of continuous speech sounds. The aspects of the adaptation include the choice of hidden Markov models (HMM) for the classification scheme as well as the choice of the features, the frame shift and the frame duration, which are more suitable for slowly varying sounds.

In current systems, the use of a single stage classification prevents the adaptation of the model and the features to the classification task. In alternative methods, the sound is classified hierarchically in several stages. To avoid the use of complex models, alternative methods use non-parametric classifiers in the last stage. Unlike early HMM methods, these methods use discriminative training, in which only the decision boundaries between speech sounds are learned. Recent years have seen the development of some new promising classifiers such as the support vector machines (SVM) and boosting, which are suitable for non-parametric classification.

This research is aimed at the design of classifiers for recognition of transition sounds at the last stage of hierarchical classification scheme. The work examines several discriminative classifiers, which model the sounds as a single event without context.

A new classification scheme is proposed. The proposed scheme contains a discriminative classifier, which produces a local score for some of the feature vectors produced by the front-end. Each element of the local score is a SVM, which is trained to discriminate the signal of one sound from the signal of other sounds in the category. Each SVM is trained and applied to the feature vectors of a fixed sequence of frames relative to the marked phone start. The best sequence for computing each element of the local score is determined during training by exhaustive search. A phoneme score is obtained by summing the local scores on some ranges of frames. The proposed scheme was tested on the stop consonants from the TIMIT corpus and has shown good performance.

The work compares several schemes for using a static learner. Several schemes were considered for choosing the subset of frames (relative the marked sounds start), whose feature vectors were used in training and classification. It was shown that the number and the location of the frames, whose feature vectors are used for training and testing the SVM, may be selected to reduce the probability of error.

The work examines several discriminative learners for producing binary threshold functions. In particular, SVM, boosting and regularized neural networks were studied. The probabilities of error of the binary threshold functions were compared experimentally on data from the TIMIT corpus. SVM were found the most effective classifiers and were easy to tune. For large training sets the performance of regularized neural network was similar to that of the SVM.