Ph.D Thesis


Ph.D StudentEfroni Yonathan
SubjectLookahead Policies: Dynamic Programming, Online Planning
and Reinforcement Learning
DepartmentDepartment of Electrical and Computer Engineering
Supervisor PROF. Shie Mannor
Full Thesis textFull thesis text - English Version


Abstract

Lookahead policies are popularly used tool in sequential decision making and are known to give rise to better performing algorithms. In this thesis, we examine different aspects of this tool in three different frameworks: planning, online planning, and reinforcement learning. For each of these frameworks, we find different answers for why, when, and how lookahead policies should be used in sequential decision making. The starting point of this thesis is re-examination of lookahead policies in planning algorithms, i.e., in the framework of dynamic programming and approximate dynamic programming. We generalize existing performance guarantees which hold for 1-step planning algorithms (e.g., value iteration and policy iteration) to lookahead-based planning algorithms, and, more importantly, highlight several important differences between 1-step planning algorithms and their lookahead-based counterparts and demonstrate their empirical performance. We then continue and establish guarantees for lookahead algorithms in online planning and prove that increasing the lookahead horizon reduces the sample complexity as well as reduces the value approximation errors. Lastly, we study the performance of lookahead policies in Reinforcement Learning  when an agent needs to tradeoff exploration and exploitation since the true dynamics is not known. We consider the standard Markov Decision Process framework, and arrive to the following result: generally, lookahead policies does not improve the performance of an RL agent that interacts with an MDP. Specifically, we prove that an algorithm that can plan only 1-step ahead achieves the optimal minimax performance. This surprising conclusion suggests that when the model is unknown, lookahead policies might not improve by much the performance of an RL agent. Avoiding planning for several steps ahead then results in an improved computational cost.