M.Sc Thesis

M.Sc StudentAviv Rotem
SubjectLearning under Delayed Feedback: Implicitly Adapting to
Gradient Delays
DepartmentDepartment of Electrical and Computer Engineering
Supervisor DR. Yehuda Levy
Full Thesis textFull thesis text - English Version


We consider constrained stochastic convex optimization problems, which captures fundamental learning problems like linear regression, logistic regression, and support-vector machines. The current machine learning model complexity and the vast amount of training data they require, causes impractical prolonged training time. Employing several machines asynchronously in parallel while sharing a common memory in order to train these algorithms has the potential to greatly decrease the learning duration. Such problems occur naturally when training machine learning algorithms on shared-resources computational systems, such as clouds and data centers. These scalable systems are central to the entire discipline of distributed learning, where more and more computations are carried on remote machines. However beneficial, when performing asynchronous updates, the performance of machine learning algorithms deteriorates due to the delayed feedback introduced.

Recent papers have shown that it is possible to obtain the optimal rates for stochastic gradient descent with delayed feedback for smooth and convex objectives. Unfortunately, the suggested methods suffer from several drawbacks; they do not hold for constrained problems, they degrade with the maximal delay, which might be substantially higher than the average delay. In addition, they are not able to accommodate changes in the delays throughout the training process, and they require prior knowledge of the maximal delay and smoothness parameter. Since changes in the delays frequently happen in shared-resources computational environments due to dynamic allocation of machines, it renders them unsuitable for such scalable systems.

Conversely, we propose a robust and adaptive training method for this problem with arbitrary time varying delays. We showcase the advantage of our proposed method both theoretically and in practice. We derive non asymptotic convergence guarantees that do not depend on prior knowledge of feedback update delays, objective smoothness, or gradient noise variance. More concretely, we prove the convergence bounds for both convex and strongly convex objectives. Our method depends on the average delay rather than the maximal one and it implicitly adapts to changes in delays.  In other words, we neither assume prior knowledge on the delays nor do we incorporate them in our model, and yet we receive optimal delay dependency. With our method, one can tune a baseline algorithm for a given model without delays, yet apply it for a delay incorporated model, without further tuning, while still achieving the optimal delay dependency. In addition, for the non-delayed standard strongly convex setting, we also develop the first adaptive stochastic gradient descent variant that do not depend on the objective smoothness or gradient noise variance. We demonstrate experimentally the scalability, adaptivity and robustness of our method by comparing it with previous suggested methods.  We confirm that our method outperforms previous ones and enables features that were otherwise unavailable.