טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentMenache Ishai
SubjectHierarchical Structures in Reinforcement Learning
DepartmentDepartment of Electrical Engineering
Supervisor Professor Nahum Shimkin


Abstract

Reinforcement Learning (RL) is a promising computational tool for constructing autonomous systems that improve with experience. Its applications have ranged from robotics, to industrial manufacturing, to combinatorial search problems, such as computer game playing .


While theoretically well established, standard RL algorithms fail in many tasks of practical interest. Such tasks are usually characterized by an enormous state space or/and the lack of an immediate reinforcement. In this dissertation we present two novel approaches for addressing these issues .


We begin with an approach for automatic detection of sub-goals in a dynamic environment, which is used for the acceleration of learning. Our Q-Cut algorithm uses an efficient Max-Flow/Min-Cut procedure for identifying bottlenecks in the state trajectories. Policies for reaching bottlenecks are separately learned and added to the model in the form of options . The basic algorithm is extended to the Segmented Q-Cut algorithm, which uses the identified bottlenecks for state space partitioning, necessary for finding additional bottlenecks in complex environments. Experiments show significant performance improvements using both algorithms, particularly in the initial learning phase .


We further propose a new method for value function approximation .

Using an RBF approximating network architecture, both the linear weights of the network and the basis-function parameters are adapted on-line. The temporal difference algorithm for linear approximation architectures is used not only for estimating the linear weights of the network, but also for posing an optimization problem for adapting the basis-functions. A gradient-based method and a global optimization algorithm (Cross Entropy) are alternatively applied to the optimization problem. Initial simulations show the potential in combing the two methods for a relatively accurate policy evaluation.