טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentShoham Tamar
SubjectQuality-Preserving Footprint-Reduction of Concatenative
Text-To-Speech Synthesizers
DepartmentDepartment of Electrical Engineering
Supervisor Professor Emeritus David Malah
Full Thesis textFull thesis text - English Version


Abstract

High quality low footprint Concatenative Text-To-Speech (CTTS) synthesizers provide a persistent challenge in the field of speech processing. The spectral parameters representing the short speech segments, used in the concatenation process, constitute a large portion of the required memory. In this work we propose algorithms for (re)compression of this previously compressed data, stored in 3D acoustic leaves. We require that the algorithms be generic, so that they may be applied for (re)compression of other data that is stored in 3D units or structures that exhibit some redundancy, primarily due to temporal evolution, but possibly along all three axes. Since the requirement for small footprint, i.e., small memory consumption, often corresponds to devices with limited resources, we also limit the algorithms to have low decoding complexity.


We propose two (re)compression approaches. The first is an algorithm from the family of Temporal Decomposition, which aims to minimize temporal redundancy between consecutive frames. We use Polynomial TD and perform data segmentation and polynomial order selection adaptively using a generalized trellis scheme. The second approach is based on 3D-Shape Adaptive DCT that removes redundancies in all three dimensions. This approach requires design of quantizers for 3D data, which we address by developing a methodical bit-allocation and splitting algorithm. We also propose a segment reordering algorithm, which may be applied before the (re)compression, to order the speech segments in a manner that will maximize overall performance. The proposed algorithms were evaluated on an IBM small footprint CTTS system, and enabled reduction of the stored amplitude spectral parameters (the main contributor to the footprint) by a factor of 2, without compromising the perceptual quality of the obtained speech. While we tested the proposed algorithms on a specific setup, they are expected to apply to a variety of 3D (re)compression challenges.