Ph.D Thesis

Ph.D StudentRudoy Dmitry
SubjectVideo Saliency and its Applications in Single and Multi-
Camera Setups
DepartmentDepartment of Electrical and Computer Engineering
Supervisor PROF. Lihi Zelnik-Manor


Understanding human attention have interested researchers for decades. Different models of attention in static scenes have emerged and evolved into dynamic saliency. Along with that, there are extensive cinematographic theories on how the scene should be watched, or filmed. Different methods propose how to place and move a camera in a static and dynamic scene.

The central contribution of this research is a novel approach to video saliency modelling. We propose a model that can effectively predict humans' attention in any particular video. The system is learnt from human examples, so our second contribution is an effective method for massive collection of gaze data. We adapt our model to multiple camera scenarios by proposing an approach for view selection based on fixed cameras. As the last contribution we propose a method to shift humans' attention by inlaying artificial objects into a video.

Our model for video saliency is based on modelling gaze as attention shifts between consecutive video frames. This is different from analyzing each image independently, as was often done before and allows us to maintain temporal stability of the saliency maps. We incorporate static, motion and semantic features from the video to propagate a saliency map from one frame to another.

To learn our saliency model we propose a crowd-sourced method for recording human gaze tracks. The method allows to record gaze location on any number of frames of any video. It does not require any special equipment and participants are not limited to any geography or culture.

In multiple camera setups we propose a method for efficient view-point selection from any set of cameras that view the same scene. As placing a camera at specified location usually requires knowledge of 3D structure our method works with fixed cameras. It is capable of ranking the cameras according to the visibility of the actions happening in the scene.

We further wish to edit the input video and shift the humans' attention. To do so we propose a user-friendly system for seamless inlaying of any 3D object into any video. We model the video as a single image, ask the user to add the object in the desired place and then render it back into the video.

To verify the proposed methods we test them on several known video datasets and on real-life videos. We compare our results quantitatively and qualitative to the state-of-the-art methods and outperform them.