Ph.D Thesis

Ph.D StudentKligvasser Idan
SubjectEfficient Architectures and Training Methods for Deep Neural
DepartmentDepartment of Electrical and Computer Engineering
Supervisor ASSOCIATE PROF. Tomer Michaeli
Full Thesis textFull thesis text - English Version


Convolutional neural networks (CNNs) have led to significant advancements in computer vision, enabling incredible accuracy in both high-level and low-level tasks. The performance of CNN models has constantly improved over the last few years, a lot due to novel architectures that allow stable training of ever deeper nets. As a consequence, the impressive advancement in accuracy has been accompanied by a tremendous increase in memory footprint and computational burden. This trend has reached the point that in many computer vision tasks, the current state-of-the-art nets have hundreds of layers with tens of millions of parameters and are thus impractical for dissemination in mobile or wearable devices. In this research we propose different mechanisms for improving CNN performance. Rather than increasing depth, we focus on making the ingredients, architectures and optimization processes more efficient.

In the first part of this research, we focus on making the nonlinear activations more effective. Most popular architectures use per-pixel activation units, e.g. rectified linear units (ReLUs), exponential linear units (ELUs), etc. We propose to replace those units by xUnit, a layer with spatial and learnable connections. The xUnit computes a continuous-valued weight map, serving as a soft gate to its input. As we show, although it is more computationally demanding and memory consuming than a per-pixel unit, the xUnit has a dramatic effect on the net's performance. It thus allows to reach the same performance with significantly less layers, offering an overall substantial improvement in the number of parameters.

In the second part of this research, we then move on to improve the optimization process of generative adversarial networks (GANs). Although GANs can generate photo-realistic samples of fantastic quality, they are often hard to train and require careful use of regularization and/or normalization methods for making the training stable and effective. We begin by analyzing the popular spectral normalization scheme, find a significant drawback and introduce sparsity aware normalization (SAN), a new alternative approach for stabilizing GAN training. As opposed to other normalization methods, our approach explicitly accounts for the sparse nature of the feature maps in convolutional networks with ReLU activations. As we show, sparsity is particularly dominant in critics used for image-to-image translation settings. In these cases our approach improves upon existing methods, while requiring practically no computational overhead.

In the last part of this research, we simplify learning of perceptual image-restoration networks by avoiding adversarial training. We propose to exploit a surprisingly dominant property of deep features: the dissimilarity between their internal distributions at different image scales. As opposed to small patches in pixel space, which are known to exhibit similar characteristics across scales, it turns out that the internal distribution of deep features varies distinctively between scales. We show how this deep self dissimilarity (DSD) property can be used as a powerful visual fingerprint. Particularly, we illustrate that image quality measures derived from DSD are highly correlated with human preference. Furthermore, incorporating DSD as a loss function in training of image restoration networks, leads to results that are at least as photo-realistic as those obtained by GAN based methods, without requiring adversarial training.