M.Sc Thesis


M.Sc StudentLeibovitch Omer
SubjectSparse Neural Networks: Theory and Practice
DepartmentDepartment of Computer Science
Supervisor ASSOCIATE PROF. Nir Ailon


Abstract

This thesis studies neural networks with a property of sparsity. Such property can be obtained by pre-specifying the non-zero weights of the network, but also by the process of pruning. The contributions include both an experimental (practical) part, where we show competitive results using these sparse models, and a theoretical part, where we study and attempt to understand the optimization problems posed by such neural networks, and try to shed light on their behavior.

The first part of the thesis focuses on butterfly networks. A butterfly network consists of logarithmically many layers, each with a linear number of non-zero weights (pre-specified). The fast Johnson-Lindenstrauss transform (FJLT) can be represented as a butterfly network followed by a projection onto a random subset of the coordinates. Moreover, a random matrix based on FJLT with high probability approximates the action of any matrix on a vector. Motivated by these facts, we propose to replace a dense linear layer in any neural network with an architecture based on the butterfly network. The proposed architecture significantly improves upon the quadratic number of weights required in a standard dense layer to nearly linear with little compromise in expressibility of the resulting operator. In a collection of wide variety of experiments, including supervised prediction on both the NLP and vision data, we show that this not only produces results that match and often outperform existing well-known architectures, but it also offers faster training and prediction in deployment.

The second part of this thesis focuses on pruning of neural networks, also known as compression or sparsification, which is the task of converting a given network, which may be too expensive to use (in prediction) on low resource platforms, with another ’lean’ network which performs almost as well as the original one, while using considerably fewer resources. By turning the compression ratio knob, the practitioner can trade off the information gain versus the necessary computational resources, where information gain is a measure of reduction of uncertainty in the prediction. In certain cases, however, the practitioner may readily possess some information on the prediction from other sources. The main question we study in this part is, whether it is possible to take advantage of the additional side information, in order to further reduce the computational resources, in tandem with the pruning process?

Motivated by a real-world application, we distill an elegantly stated problem, which is the task of shared-label prediction.

We propose various methods for performing this task, and compare them using extensive experiments on public benchmark data sets for image classification. Our comparison is based on measures of relative information (RI) and n-accuracy, which we define. Interestingly, we empirically find that i) sharing information between the n instances, using a carefully crafted LSTM architecture performs best, among all methods we experiment with, ii) for all methods studied, we exhibit a sweet spot phenomenon, which sheds light on the compression-information trade-off and may assist a practitioner to choose the desired compression ratio.