M.Sc Thesis

M.Sc StudentChapman Hillel
SubjectGlobal Confidence Bound Algorithms for the Exploration-
Exploitation Tradeoff in Reinforcement Learning
DepartmentDepartment of Electrical and Computer Engineering
Supervisor PROF. Nahum Shimkin
Full Thesis textFull thesis text - English Version


Reinforcement Learning (RL) is a promising computational approach for constructing autonomous systems that improve with experience. Its applications range from robotics, through industrial manufacturing and scheduling, to combinatorial search problems such as board games.

While theoretically sound, standard RL algorithms fail in many tasks of practical interest. Such tasks are usually characterized by an enormous state space and/or lack of immediate reinforcement. We believe a major element of successes in an RL algorithm is the exploration algorithm.

In this thesis we propose efficient exploration schemes for on-line RL, focusing on the standard Markov Decision Process model with finite state and action spaces, and the discounted cost criterion. We present the Global Confidence Bound (GCB) algorithm that tackles the explore-exploit tradeoff directly. Essentially, the algorithm (1) modifies the standard estimates of the model parameters (rewards and transition probabilities) by adding appropriate (upper) confidence bounds; (2) computes the value function associated with the modified parameters, which is shown to majorize the true value function with high probability; (3) chooses the next action greedily based on the computed value function. Our algorithm is conceptually simple and intuitive, and yet ensures fast convergence to the optimal policy. Specifically, for the class of models with deterministic transitions, we establish a polynomial bound on the number of suboptimal actions taken by our algorithm.

The GCB algorithm requires the computation of the value function at each stage, leading to potentially high computational load. We present a novel variant of our basic algorithm, the GCB-SB (Single Backup) algorithm that eliminates the computational burden by performing a single Bellman update at each time step. We provide a polynomial bound for the number of suboptimal actions taken by this algorithm as well. We finally demonstrate the performance of the proposed algorithms using several simulation experiments.