|M.Sc Student||Ben Uri Liad|
|Subject||Improving Efficiency of DNN Training using Stochastic|
|Department||Department of Electrical and Computer Engineering||Supervisor||ASSOCIATE PROF. Daniel Soudry|
|Full Thesis text|
In recent years deep learning has influenced many fields of study, and its influence is growing steadily larger. Dealing with real-world problems and generating ever-more general and robust capabilities has faced the need to use larger and larger datasets, matched with deeper and more complicated architectures (reaching billions of parameters in modern NLP models, for example). This combination led to training times reaching thousands of GPU-hours to train a single model. In this research, we alleviate one of the major bottlenecks of training through the compression of the neural gradients. Most existing neural network compression methods (e.g., pruning or quantization) focus on weights, activations, and weight gradients. We show that these methods are not suitable for compressing neural gradients, which have a very different distribution. Specifically, we find that the neural gradients follow a lognormal distribution. Taking this into account, we suggest using stochastic pruning to reduce the computational and memory burdens of neural gradients. Using the lognormal distribution, we formulate a theoretical way to find the threshold that would yield a certain desired sparsity level using stochastic pruning, given only the distributional parameters of the tensor. We then evaluate the performance of the algorithm and identify that different layers are affected differently by the same level of pruning. We thus measure the effect of the cosine similarity between the pruned and the original tensor on the final accuracy and derive a similar theoretical framework to control it and restrict the damage of the pruning process to the different layers. We use these theoretically derived measures to formulate a new algorithm to heterogeneously allocate different sparsity levels between the layers, to better preserve the cosine similarity and achieve higher sparsity levels while maintaining the baseline accuracy. We experiment broadly with the newly proposed pruning methods and quantify their ability to accelerate the training of different architectures using diverse datasets.