M.Sc Thesis

M.Sc StudentLi Mingzi
SubjectMultisensory Speech Enhancement in Noisy Environments Using
Bone-Conducted and Air-Conducted Microphones
DepartmentDepartment of Electrical and Computer Engineering
Supervisor PROF. Israel Cohen
Full Thesis textFull thesis text - English Version


In this thesis, we address the speech enhancement problem by exploiting bone-conducted and air-conducted microphones. A bone-conducted microphone, being less sensitive to surrounding noise, may be utilized to complement a regular air-conducted microphone in noisy environments. However, since the high frequency components of bone-conducted microphone signals are attenuated significantly due to transmission loss, the quality of speech signals acquired by a bone-conducted microphone is relatively low. Therefore, we wish to enhance the speech signal by combining both microphones and producing high quality speech signal with low background noise.

Existing multisensory speech enhancement methods may be classified into two main categories, according to the role of the bone-conducted microphone: In one category, the bone-conducted microphone is used as a supplementary sensor, whereas in the other category the bone-conducted microphone is used as a dominant sensor. Implementation in the first category relies on the accuracy of a voice activity detection or pitch extraction facilitated by the bone-conducted microphone. When the bone-conducted microphone is exploited as the main acquisition sensor, algorithms are related to either equalization, analysis-and-synthesize, probabilistic approaches.

In this thesis, clean speech is restored through a family of functions named geometric harmonics, i.e., eigenfunction extensions of a Gaussian kernel. Geometric harmonics can describe the geometry of high dimensional data and extend these descriptions to new data points, as well as the function defined on the data. In our case, the high dimensional data is defined by concatenation of air-conducted and bone-conducted speech in the short time Fourier transform (STFT) domain. A nonlinear mapping to the STFT of clean speech defined on the new concatenation of speech signals can be obtained by a linear combination of geometric harmonics.  

Application of geometric harmonics requires a careful setting of the correct extension scale and condition number. As a result, a multi-scale Laplacian pyramid extension is utilized to avoid scale tuning. Based on the kernel regression scheme, Laplacian pyramid extension approximates the residual of the previous representation via a series of Gaussian kernels.

Experiments are conducted on simulated air-conducted and bone-conducted speech in interfering speaker and Gaussian noise environments. Geometric methods provide a consistent reconstruction of speech spectrograms in a variety of noise levels and categories. Log spectral distance results obtained using the proposed methods are compared to an existing probabilistic approach. We show that the Laplacian pyramid method outperforms the other methods.