|Ph.D Student||Hallak Assaf|
|Subject||Off-Policy Evaluation in a Reinforcement Learning Setting|
|Department||Department of Electrical Engineering||Supervisor||Professor Shie Mannor|
|Full Thesis text|
The framework of Reinforcement Learning (RL) considers sequential decision making by an agent in an unknown stochastic environment. In most basic setups, the agent employs some policy over time to maximize its cumulative reward. RL techniques can then be used to propose new, higher revenue policies based on observed data. Ideally, these new policies can be deployed and evaluated forming an evaluation and improvement cycle. Unfortunately, in many domains there is limited sampling capability preventing the system from applying sub-optimal policies over time. Subsequently, assessing the performance of a complex strategy without trying it, known as off-policy evaluation (OPE), is essential for providing strong evidence that a new policy is better than the one in use.
In this work we explore and analyze novel ways to perform OPE in an RL setting. We begin with a model-based approach especially suited to large scale problems called factored Markov decision problems (FMDPs). The main hardness in learning FMDPs is inferring their structure efficiently. Our method G-SCOPE greedily learns the structure in a way which scales with the amount of available data and we provide finite sample guarantees for the OPE error. Then, we extend an existing on-line model free approach called ETD to allow for a bias-variance trade-off and analyze its convergence; our extended ETD algorithm bridges the gap between two previous corner-stone algorithms. Finally, we point to an existent fault in state-of-the-art on-line OPE algorithms caused by a discrepancy between the behavior and target stationary distribution. We propose a simple and efficient way to amend it using a new algorithm called COP-TD. We analyze its properties and show it converges under reasonable conditions.