|M.Sc Student||Barkai Saar|
|Subject||Distributed Deep Neural Networks|
|Department||Department of Electrical Engineering||Supervisor||Professor Assaf Schuster|
|Full Thesis text|
Cloud computing, which is becoming increasingly popular as a platform for distributed training of deep neural networks, is prone to straggler workers who significantly increase training time in synchronous environments. Therefore, asynchronous distributed training is preferable in terms of training time when using cloud computing. Distributed asynchronous training enjoys near-linear speedup, but asynchrony causes gradient staleness; the main difficulty in scaling stochastic gradient descent to large clusters. We introduce a novel method for mitigating the gradient staleness, which enables the use of large numbers of asynchronous workers. The method, named Gap-Aware, mitigates gradient staleness by reducing the size of incoming gradients based on a new, more accurate, measure of their staleness we refer to as the Gap. Gap-Aware outperforms the existing methods for gradient staleness penalization in terms of final accuracy and convergence rate. We show that Gap-Aware can be easily and naturally combined with DANA, a method which battles gradient staleness by estimating the model's future parameters. Our evaluation shows that combining both methods to a single algorithm, DANA-Gap-Aware (DANA-GA), produces state of the art results in terms of final accuracy and convergence rate. The evaluation is done on the well-known CIFAR, ImageNet, and WikiText-103 datasets.