|Ph.D Student||David Yahel|
|Subject||Learning in Large Multi-Armed Bandits: Models,|
Algorithms and Bounds
|Department||Department of Electrical Engineering||Supervisor||Professor Nahum Shimkin|
|Full Thesis text|
The Multi-Armed Bandit (MAB) problem is a classic instance of sequential decision making. In the MAB problem, the learning agent faces a set of stochastic arms, from which it chooses arms sequentially. In each round, a stochastic reward that depends on the selected arm is obtained. The goal of the learning agent is to maximize the total obtained reward under the regret framework, or to find the best arm, for a given criterion, under the PAC framework.
The power of the MAB problem is in its ability to capture many recent applications such as recommendation and routing problems, and to provide a generic approach for solving them. In particular, it captures the fundamental exploration vs. exploitation dilemma.
One of the drawbacks of the basic MAB problem and its solution is in theirs sensitivity to the number of arms. More precisely, in the classical MAB problem, the regret and the sample complexity, which are common performance measures for MAB algorithms are linear in the number of arms.
In this thesis we consider models of realistic scenarios that involve a large number of arms. We study and propose MAB variants that capture those scenarios. We develop lower bounds and algorithms for that MAB variants, for some under the regret framework and for the other under the PAC framework.
This thesis consists of seven parts, in the second and third parts, we consider the Infinetely Many-Armed Bandit (I-MAB) problem, in which the number of arms is infinite. In this problem there is no known structure, but there are some statistical assumptions related to the mean rewards of the arms. We study the classical I-MAB problem as well as some generalizations thereof, motivated by recent applications.
In the fourth and fifth parts, we study the Max-Bandit and the Max-Quantile Bandit problems. Those problems are generalizations of the I-MAB problem. However each of them has its own concrete motivation.
In the sixth part of this thesis, we consider the scenario in which there is some known coupling between the mean rewards of the arms. So, statistics obtained for one arm can be meaningful for others as well. We develop an algorithm that uses that structure for sampling suboptimal arms less often than otherwise.