טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentBarsky Daniel
SubjectCONQUER - Confusion Queried Bandit Learning
DepartmentDepartment of Electrical Engineering
Supervisor Professor Yacov Crammer
Full Thesis textFull thesis text - English Version


Abstract

We present a new recommendation setting for picking two items to be highlighted to a user from a given set, based on contextual input. In this setting, given a set of K instances, the user is presented with two instances, and must choose one of them, possibly stochastically, with bias that favors the item with the higher value. This setting is applicable in a suggestion system, where two items are displayed to the user in a highlighted fashion given his search query. This setting is also applicable in problems that are difficult to annotate fully, and can benefit from light feedback, such as sentence dependency parsing.

We propose a solution for this setting, which is a second-order algorithm that uses relative upper-confidence bounds and combines greedy and UCB policies to perform exploration and exploitation simultaneously.

We aim to minimize a regret which compares the best item from the set of K items at time t to the better prediction made by the algorithm at that time. We analyze this algorithm in an adversarial setting, with only mild assumption on the data, and prove a regret bound of

O(Q?T (T Q?T log T)1/2 T1/2 log T) that holds with high probability, where T is the number of rounds and Q?T is the cumulative approximation error of item values using a linear model. We evaluate our algorithm in 3 different settings: reviews of items from 33 domains from Amazon, 9 text categorization tasks and 4 vowel recognition tasks. Performance in trials shows the advantage of our methods over several other algorithms designed for this and related settings, both in terms of online loss and generalization error on an unseen test set.

Finally, we suggest a similar solution for training dependency parsers with feedback only on the source of a single word in the sentence. For this solution we prove a regret bound of O(ST1/2log T), where S is an upper bound on the length of a sentence.