M.Sc Student M.Sc Thesis Dyagilev Kirill Efficient Reinforcement Learning in Parameterized Models Department of Electrical and Computer Engineering PROF. Nahum Shimkin

Abstract

Reinforcement Learning is a promising computational approach for implementing autonomous systems that improve their performance with experience. Its applications range from robotics, through industrial manufacturing and scheduling, to combinatorial search problems such as board games.

Learning rates of Reinforcement Learning algorithms critically depend on the efficient exploration of the state and action spaces. Recently, several algorithms that employ efficient exploration techniques were introduced and proven to possess learning rate metrics that are polynomial in the model size. Still, in absence of prior knowledge regarding the model structure or parameters, learning time grows at least linearly in the size of the state and action spaces. While these algorithms are efficient for small problems, they become infeasible in cases where the size of the model is large. This limitation may be overcome if prior information is known, as is often the case.

In this thesis we consider reinforcement learning in the parameterized setup. We first assume that the model is known to belong to a finite set of Markov Decision Processes under the discounted return criterion. We propose an on-line algorithm for learning in such parameterized models, the Parameter Elimination (PEL) algorithm, and analyze its performance in terms of the total mistake bound criterion. The algorithm relies on Wald's sequential probability ratio test to eliminate unlikely parameters, and uses an optimistic policy for effective exploration.

We establish that, with high probability, the total mistake bound for the algorithm is linear (up to a logarithmic term) in the size of the parameter space, independently of the cardinality of the state and action spaces. We further demonstrate that better dependence on size of the parameter space is possible, depending on the specific information structure of the problem.

We further extend the PEL algorithm to the continuous parameter case through discretization, and analyze its performance. This algorithm is also elimination-based, however, it relies on a modified version of the sequential probability ratio test that takes into account the mismatch between the true parameter and the closest point of the discretization grid. The total mistake bound of the continuous parameter PEL algorithm is shown to be linear in the number of points on the grid needed for discretization. We conclude this thesis by considering an extension of our scheme to an average reward criterion and discuss the ideas involved.