M.Sc Thesis

M.Sc StudentTiomkin Stanislav
SubjectA Segment-Wise Hybrid Approach for Improved Quality
Text-to-Speech Synthesis
DepartmentDepartment of Electrical and Computer Engineering
Supervisor PROFESSOR EMERITUS David Malah


Concatenative Text-To-Speech (CTTS) synthesis and statistical TTS (STTS) synthesis are the two main approaches to text-to-speech synthesis. CTTS directly uses parameters of natural speech features, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, since the desired segments, having required characteristics, are not always available, other segments with the closest characteristics to the required ones are used instead. Concatenation of such segments may result therefore in audible discontinuities. On the other hand, a STTS system, while having a smaller footprint than CTTS, generate speech that is free of such discontinuities but, it often sounds unnatural.

In this research we developed two approaches aimed to improve the quality of TTS generated speech. First, we develop two techniques for improving the quality of the baseline STTS system. Second, we propose a technique for combining CTTS with STTS for a new class of TTS systems, denoted hybrid TTS (HTTS).

In STTS, speech feature dynamics is modeled by first- and second-order feature frame differences, which, typically, do not satisfactorily represent frame to frame feature dynamics present in natural speech. The reduced dynamics results in over-smoothing of speech features, often sounding as muffled and buzzy synthesized speech. To enhance a baseline STTS system we propose two methods. First, we propose to represent speech features dynamics in the transform domain and not directly in terms of frame to frame variation. In the transform domain, the insufficient dynamics is characterized explicitly by a marked attenuation in inter-harmonic components. The quality of speech generated by a STTS system is improved by enhancing these attenuated components.

Second, we introduce a segment-wise model representation with a norm constraint. The segment-wise representation provides additional degrees of freedom in speech feature determination which we exploit for increasing the speech feature vector norm to match a norm constraint. Thus, statistically generated speech features are not over-smoothed, resulting in more natural sounding speech. The segments-wise representation method is superior to the transform domain method in terms of generated speech quality.

 Also, we propose to combine the advantages of CTTS and STTS into another type of TTS, denoted HTTS. This is a hybrid system in which, for each utterance, natural segments and model-generated segments are interweaved via a hybrid dynamic path algorithm. Thus, speech generated by the proposed HTTS includes less discontinuities than the baseline CTTS system does, and it sounds more natural than the baseline STTS.