Ph.D Thesis

Ph.D StudentRott Shaham Tamar Lea
SubjectSemantics Aware Image Processing
DepartmentDepartment of Electrical and Computer Engineering
Supervisor ASSOCIATE PROF. Tomer Michaeli
Full Thesis textFull thesis text - English Version


Algorithms for processing visual data are ubiquitous in many fields of science, with applications ranging from signal recovery (e.g., removing noise or blur, increasing resolution) to recognition (e.g., detecting objects within an image). A key ingredient in most such algorithms is a measure of similarity between images, which is often used as an optimization criterion. In the first part of this research we present a new deformation aware image similarity criterion. As opposed to both traditional fidelity measures (e.g. sum of square differences) and more recent approaches, our criterion is insensitive to small local translations between images, and thus considers two images as similar even if they differ by small local translations. We demonstrate our approach in the context of image compression. As we show, optimal compression under our new criterion boils down to slightly deforming the input image such that it becomes more ``compressible''. We also illustrate how deformation indifference can be harnessed to explore the geometric preferences of image priors, which play a fundamental role in many low-level vision tasks. Our algorithm reveals intriguing properties of popular image models which had not been noticed in the past.

We next take the deformation insensitivity idea a step further, and develop a technique that not only allows for deformations, but can also completely alter the image while preserving its semantics. In recent years, the idea of manipulating an image based on its semantics was extensively explored, mainly by exploiting Generative Adversarial Networks (GANs). However, generation of high quality complex scenes requires a very large training set. Here, we introduce SinGAN, an unconditional generative model that can be learned from only a single natural image. Our model is trained to capture the internal distribution of patches within the image, and is then able to generate high quality, diverse samples which semantically resemble the training image, yet contain new object configurations and structures. Furthermore, after trained on a single image, SinGAN can be used within a simple unified framework to solve a variety of tasks, including paint-to-image, editing, harmonization, super-resolution, and animation from a single image.

Training a GAN on a single image holds an additional benefit-- it requires only a lightweight architecture with a short forward time. However, it  is necessary to train a new model for each specific input image at inference time, and therefore the framework is unsuitable for real-time applications. Networks trained on a large dataset generalize well to unseen images. However this comes at the cost of huge models with typically tens of millions of parameters, and long inference time. Thus, external models cannot be used in real-time applications either. In the last part of our research we propose a new framework that aims to combine the best of both worlds. We show how to make an external GAN model adaptive to each specific input it processes at test time, and illustrate that combining this with a novel lightweight architecture significantly reduces runtime of image translation networks.