טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentYehuda Levy
SubjectA Unified Framework for Hierarchical Reinforcement Learning
Using Options
DepartmentDepartment of Electrical Engineering
Supervisor Full Professor Shimkin Nahum
Full Thesis textFull thesis text - English Version


Abstract

Reinforcement Learning (RL) is a general framework for learning optimal sequential decision-making policies, through interaction with a possibly unknown and stochastic environment. Having succeeded in many challenging application in the fields of robotics, board games, industrial manufacturing and many more, reinforcement learning has attracted a great deal of research interest and become one of the central topics in machine learning and artificial intelligence.

In complex planning problems it is often useful to utilize macro-actions, where every macro-action is a restricted plan or a policy, and compose these macro actions in order to form an overall plan or a policy. This way, decisions are made in a hierarchical manner, the high level decision hierarchy chooses a macro-action which in turn chooses a low level (primitive) action. However, since the overall hierarchical plan is usually non stationary, we cannot directly use existing reinforcement learning methods. Moreover, in such hierarchical decision algorithms, the stopping condition of a macro-action, i.e., the rule that defines when should a macro-action end is either fixed or changed greedily.

In this dissertation we formulate a unifying framework which enables an agent that makes decisions in a hierarchical manner, to simultaneously optimize his controls in all levels of decision making: stopping conditions, macro-action internal policies and high level choices. This framework translates the overall hierarchical plan to an equivalent stationary policy, to which standard RL algorithms can be applied.

We then find leverages that enable a computationally efficient implementation of the natural policy gradient algorithm and the LSTD algorithm in learning the equivalent stationary policy.