M.Sc Thesis

M.Sc StudentGenussov Michal
SubjectTranscription and Classification of Audio Data by Sparse
Representations and Geometric Methods
DepartmentDepartment of Electrical and Computer Engineering
Supervisor PROF. Israel Cohen
Full Thesis textFull thesis text - English Version


Transcription of music and classification of audio and speech data are two important tasks in audio signal processing. Transcription of polyphonic music involves identifying the fundamental frequencies (pitches) of several notes played at a time. It is an intriguing task, whose difficulty stems from the fact that harmonics of different tones tend to overlap, especially in western music. This causes a problem in assigning the harmonics to their true fundamental frequencies, and in deducing spectra of several sounds from their sum.

Classification of audio and speech data includes classification of music by genre and identification of speech phonemes. Traditional classification methods consist of two main stages: the first is feature extraction, in which relevant features (usually temporal and spectral) are extracted from the signal, and the second is classification according to these features. The problems with these methods are that they are usually not well-adjusted to the non-linear structure of the feature vectors, and they don’t consider the redundancy of the features, leading to unsatisfactory classification results and to high computational complexity.

In this thesis, we introduce transcription and classification methods which are based on representation of the data in a meaningful manner. For transcription of polyphonic music we present an algorithm based on sparse representations in a structured dictionary, suitable for the spectra of music signals. Thanks to the structured dictionary, the algorithm does not require a diverse or a large data set, and is computationally more efficient than alternative methods.

For classification of audio data we propose to integrate into traditional classification methods a non-linear manifold learning technique, namely ”diffusion maps”. In this technique, a graph is built from the feature vectors, and the distances in the graph are mapped to Euclidean distances.

Finally, we examine empirically the performances of the proposed solutions. We show that our structured-based dictionary transcription system outperforms existing methods in several tasks of transcription, especially in the difficult case of a small data set with multiple overlaps of harmonics. In classification of musical pieces by genre and in identification of unvoiced fricative phonemes by diffusion maps, most of the samples are classified correctly. However, comparing to classification using the features without the mapping, we find that the classification results are not improved. This implies that the assumption of the non-linear redundancy between the features depends on several factors, including the application and the choice of the features before mapping.