M.Sc Thesis


M.Sc StudentSadeh Kenigsfield Gal
SubjectLeveraging Auxiliary Text for Deep Recognition of Unseen
Visual Relationships
DepartmentDepartment of Electrical and Computers Engineering
Supervisor PROF. Ran El-Yaniv
Full Thesis textFull thesis text - English Version


Abstract

An important challenge in computer vision is \emph{scene understanding}, one aspect of understanding the scene is the recognition of interactions between objects. This task -- often called visual relationship detection (VRD) -- must be solved to enable higher understanding of the semantic content in images.

VRD can become particularly hard where there is severe statistical sparsity of some potentially involved objects, and the number of many relationships in standard training sets is limited. A recent trend is scene understanding is the fusion of different signals, e.g audio and text, in order to enhance modern computer vision systems. The motivation behind this trend originates in neuroscience where research has shown consistently that biological systems fuse information from different sensors in order to infer the current scene. For example seeing a movie with no sound is a lesser experience for the viewer, even if subtitles are provided.

In this research we show how to transduce auxiliary text so as to enable recognition of relationships absent in the visual training data and enhance the capabilities of current VRD systems.

This transduction is performed by learning a shared relationship representation for both the textual and visual information.

The proposed approach is \emph{model-agnostic} and can be used as a plug-in module in existing VRD and  \emph{scene graph generation} (SGG) recognition systems to improve their performance and extend their capabilities.

We consider the application of our technique using three widely accepted SGG models \cite{xu2017scene,zellers2018scenegraphs,tang2019learning}, and different auxiliary text sources: image captions, text generated by a deep text generation model (GPT-2), and ebooks from the Gutenberg Project.

We conduct an extensive empirical study of both the VRD and  SGG tasks over large-scale benchmark datasets.

Our method is the first to enable recognition of visual relationships missing in the visual training data and appearing only in the auxiliary text.

We conclusively show that text ingestion enables recognition of unseen visual relationships, and moreover, advances the state-of-the-art in all SGG tasks.