|M.Sc Student||Wolfenfeld Amit|
|Subject||Dueling Bandits Black Boxes|
|Department||Department of Electrical Engineering||Supervisor||Professor Nir Ailon|
In machine learning, the notion of multi-armed bandits refers to a class of online learning problems, in which a learner (also called decision maker or agent) explores and exploits a given set of choice alternatives in the course of a sequential decision process. In the standard setting, the learner learns from stochastic feedback in the form of real-valued rewards. We study a partial information online learning problem where actions are restricted to noisy comparisons between pairs of alternatives - The Dueling Bandits Problem. As opposed to conventional approaches which requires the real valued reward of the chosen arms to be numerical and observable, the Dueling Bandits setting only assumes that the (noisy) stochastic feedback about the relative reward of two chosen arms is readily available. This type of relative feedback, or pairwise preference, is particularly fitting in cases where numerical or absolute rewards are difficult to quantify (for instance, user preference of a set of search engine results, preference of taste), but where pairwise preference are easy to make. In this paper we propose several new methods for the Dueling Bandit Problem. Our approach extends the Doubler and Sparring algorithms. We show empirical results using real data from Microsoft Research’s LETOR project.