טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentVolfin Ilana
SubjectDominant Speaker Identification for Multipoint
Videoconferencing
DepartmentDepartment of Electrical Engineering
Supervisor Professor Israel Cohen
Full Thesis textFull thesis text - English Version


Abstract

A multipoint conference consists of N participants received through N distinct channels. Dominant speaker identification is the task of identifying which of the participants is the most dominant at a given time. The identification facilitates reducing the computational load on the communication network, on which the conference is conducted. In addition, it enables the conferees to focus their attention on the most active participant. The techniques applied to this problem so far, rely mostly on instantaneous measures of speech activity, with the underlining assumption that a switch in speaker is characterized by a rise in this activity.

These methods neglect the prolonged properties of dominant speech, thus making them prone to false speaker switching. They also fail to discern between speech and non-speech transient audio occurrences.


In this thesis, we propose a dominant speaker identification method that reduces the number of false speaker switches and is also robust against transient audio occurrences.

The proposed method is based on speech activity evaluation on time intervals of three different lengths. The speech activity for immediate, medium and long time intervals is evaluated and represented by three respective speech activity scores. In the process of score evaluation, we propose two models for the likelihood of the detected audio activity. One for the assumption that speech is present in the time interval and the other for speech absence. The unique set of scores acquired for each channel, facilitates the identification of the dominant speaker. The identification is based on a comparison of scores across all channels.


Objective evaluation of the proposed method is performed on a synthetic conference with and without the presence of transient audio occurrences. The performance is evaluated in terms of the time segments in which a false dominant speaker was identified. A segment of falsely detected dominant speaker usually follows a false speaker switch. In this evaluation framework, the errors reflect the total number of false switches, the location of the falsely detected segment within the dominant speech burst and its duration.


A qualitative evaluation on a segment of a real multipoint conference is conducted as well. The proposed method is compared with existing methods of dominant speaker identification. We achieve reduction in the number of false speaker switches and improved robustness against transient audio occurrences.