טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentLevine Nir
SubjectRotting Bandits
DepartmentDepartment of Electrical Engineering
Supervisor Professor Shie Mannor
Full Thesis textFull thesis text - English Version


Abstract

The Multi-Armed Bandits (MAB) framework highlights the trade-off between acquiring new knowledge (Exploration) and leveraging available knowledge (Exploitation), one of the most fundamental trade-offs in stochastic decision theory. In the classical MAB problem, a decision maker must choose an arm at each time step from a fixed number of arms, upon which she receives a reward. The decision maker's objective is to maximize her cumulative expected reward, or equivalently, minimize her regret over the time horizon. The MAB problem has been studied extensively, specifically under the assumption of the arms' rewards distributions being stationary, or quasi-stationary, over time. We consider a variant of the MAB framework, which we termed Rotting Bandits, where each arm's expected reward decays as a function of the number of times it has been pulled. We are motivated by many real-world scenarios such as online advertising, content recommendation, portfolio management, crowdsourcing, and more. In this thesis we consider two cases of the Rotting Bandits problem: (1) There is no prior knowledge on the expected rewards, and (2) there is prior knowledge that the expected rewards comprised of a constant part and a rotting part which is known to belong to a set of rotting models. We present algorithms, accompanied by simulations, and derive theoretical guarantees.