|M.Sc Student||Adelman Menachem|
|Subject||Faster Neural Neworks Training with Approximate Tensor|
|Department||Department of Electrical and Computers Engineering||Supervisor||ASSOCIATE PROF. Mark Silberstein|
|Full Thesis text|
Approximation techniques for faster inference and training of deep neural networks (DNNs) have received considerable attention. Sampling-based approximations were used to accelerate inference, but using them in training has not been systematically studied nor demonstrated end-to-end GPU performance benefits in practice.
We propose a novel approach to accelerating DNN training by systematically approximating tensor operations via sampling. At a high level, the original matrix products and convolutions are replaced with their faster approximate versions. The approximation is applied separately to each tensor operation, keeping the network architecture and tensor dimensions intact, thereby facilitating the adoption of this technique in existing DNN training frameworks, potentially in combination with other approximation techniques. Furthermore, when combined with distributed training, our technique allows for seamless reduction in the communication bandwidth and increased performance gains.
We compare several known algorithms for approximate matrix multiplication, and find column-row sampling (CRS) to be the most suitable for approximating matrix multiplications in training.
We study the application of CRS to linear regression, and observe that the resulting gradients differ from the exact training due to the dependency between sampled features. To this end, we propose a new Bernoulli-CRS variant which achieves statistical independence of samples. We study the theoretical guarantees for Bernoulli-CRS, and find that it provides better approximation for certain cases. We show that using Bernoulli sampling in linear regression is equivalent to dynamic L2 weight regularization of the original, non-approximate loss.
We then study a new TopK-CRS variant which deterministically selects the top-k column-row pairs with the highest norms. We show that this algorithm is equivalent to the minimal mean square error estimator (MMSE) in case column-row pairs are pairwise independent with zero mean.
We generalize matrix product approximation to convolutions and analyze the approximation error to derive the optimal sampling policy. This allows us to apply approximations to training of convolutional neural networks (CNNs).
We consider two approximation approaches to DNN training: one where approximate computations are used in both forward and backward passes, and another limited to approximating computations in the backward pass only.
We analyze the backward-only sampling and show that it provides the same convergence guarantees as the exact SGD under certain conditions. However, we find in practice that forward pass approximations lead to better model accuracy and larger computational savings.
We implement our techniques in PyTorch and evaluate them on several DNN topologies, including MLP and CNN networks on MNIST, Wide ResNet 28-10 on CIFAR-10 and ResNet-50 and ResNet-152 on ImageNet. We demonstrate up to 66% reduction in the number of FLOPs and up to 1.33x faster training time with little or no degradation in model accuracy.
We develop another flavor of TopK-CRS which enables reducing the amount of gradient communication between workers. Notably, our algorithm is compatible with the standard AllReduce approach used in distributed deep learning. We implement an AllReduce scheme that takes advantage of the smaller gradient footprint and demonstrate 1.09x-1.37x speedup in multi-node training.