|M.Sc Student||Cassel Asaf Benjamin|
|Subject||A General Approach to Multi-Armed Bandits Under Risk|
|Department||Department of Electrical Engineering||Supervisor||Professor Shie Mannor|
|Full Thesis text|
Different risk-related criteria have received recent interest in learning problems. Previous works typically viewed criteria as unrelated and analyzed each one in a customized manner. This resulted in a long and arduous process which burdened the introduction of new criteria. In this work we provide a more systematic approach to analyzing risk criteria within a stochastic multi-armed bandit (MAB) formulation. We consider the class of empirical distribution based criteria, which measure variation within the sampled reward sequence, and identify key properties of its members. This results in explicit conditions under which we provide a simple characterization of the oracle rule (which serves as the regret benchmark), and design an upper confidence bound (UCB) learning policy, which incurs logarithmic regret. The conditions are derived from problem primitives, primarily focusing on the relation between the arm reward distributions and the (risk criteria) performance metric. Among other things, the work highlights some (possibly non-intuitive) subtleties that differentiate various criteria in conjunction with statistical properties of the arms. An additional minor result includes the analysis of an intermediate leaning goal that measures variation in overall performance between different trials. Our main findings are illustrated on several widely used objectives such as conditional value-at-risk, mean-variance, Sharpe-ratio, and more.