M.Sc Thesis | |

M.Sc Student | Menache Ishai |
---|---|

Subject | Hierarchical Structures in Reinforcement Learning |

Department | Department of Electrical and Computer Engineering |

Supervisor | PROF. Nahum Shimkin |

Reinforcement Learning (RL) is a promising computational tool for constructing autonomous systems that improve with experience. Its applications have ranged from robotics, to industrial manufacturing, to combinatorial search problems, such as computer game playing .

While theoretically well established, standard RL algorithms fail in many tasks of practical interest. Such tasks are usually characterized by an enormous state space or/and the lack of an immediate reinforcement. In this dissertation we present two novel approaches for addressing these issues .

We begin with an approach for automatic detection of
sub-goals in a dynamic environment, which is used for the acceleration of
learning. Our *Q-Cut* algorithm uses an efficient Max-Flow/Min-Cut
procedure for identifying bottlenecks in the state trajectories. Policies for
reaching bottlenecks are separately learned and added to the model in the form
of options . The basic algorithm is extended to the *Segmented
Q-Cut* algorithm, which uses the identified bottlenecks for state space
partitioning, necessary for finding additional bottlenecks in complex
environments. Experiments show significant performance improvements using both
algorithms, particularly in the initial learning phase .

We further propose a new method for value function approximation .

Using an RBF approximating network architecture, both the linear weights of the network and the basis-function parameters are adapted on-line. The temporal difference algorithm for linear approximation architectures is used not only for estimating the linear weights of the network, but also for posing an optimization problem for adapting the basis-functions. A gradient-based method and a global optimization algorithm (Cross Entropy) are alternatively applied to the optimization problem. Initial simulations show the potential in combing the two methods for a relatively accurate policy evaluation.