M.Sc Thesis

M.Sc StudentNuriel Tamir
SubjectRegion-of-Interest Based Adaptation of Video for Mobile
DepartmentDepartment of Electrical and Computer Engineering
Supervisor PROFESSOR EMERITUS David Malah


The goal of this work is to develop methods for the adjustment of video in standard definition resolution to the smaller resolution used in mobile devices. The naive solution is to scale each frame to the desired size, thus reducing the size of all the objects. But, objects must be at a sufficient size to be recognized easily. Therefore, it was suggested in previous works to display only a Region Of Interest (ROI), which include only the most important content .

We focus on news broadcasting and interview scenes. The ROI is defined by a rectangle around the speaker's face. For every scene, the ROI is detected and tracked up to its end. Only the ROI is adjusted to the desired resolution. An editor can mark several ROIs for tracking. For example, marking two speakers in an interview scene and transmit only the active speaker in each frame.

We present a novel algorithm for tracking the ROI by estimating the global motion caused by the camera and the local motion caused by the speaker's movements. We use a pan, tilt and zoom camera model. We present an algorithm for estimating these parameters in the frequency domain. We show that only slices of the 2D Fourier transform are needed. By using the slice-projection theorem, we can compute these slices efficiently by computing the 1D Fourier transform of the horizontal and vertical projections. We further reduce complexity by using the Mellin-transform in the spatial domain. This transform converts the scale parameter into a multiplicative parameter. We compute and compare the computational complexity of the different methods and performance analysis of the estimation is given . We further improve tracking by using Kalman or particle filtering.

We present a tool for detecting the ROI from a single point indicated by an editor, using a skin-color detector combined with region growing.

Another tool developed is for detecting the ROI containing the current active speaker. We tested methods solely based on visual information. Another method examined uses also the soundtrack. We track visual points on the speakers' faces and find the time instants of significant changes in their motion, and compare them to audio onset times. The correlation between the time instants of the events in video and audio is used to determine the current active speaker.

All the tools were tested by simulations as well as by real scenes recorded from TV.