Ph.D Thesis

Ph.D StudentBaram Nir
SubjectFrom Imitation to Maximum Entropy Reinforcement Learning
DepartmentDepartment of Electrical and Computer Engineering
Supervisor PROF. Shie Mannor
Full Thesis textFull thesis text - English Version


Learning by imitation has become increasingly popular in fields where it is easier for an expert to demonstrate a behavior than programming it. Meanwhile, the last few years have seen a surge in the popularity of imitation algorithms that leverage generative adversarial networks for reward functions. In our first work, we attempt to mitigate the issue of high-variance reward estimation -- a fundamental problem in policy-gradient-based adversarial imitation. By introducing a forward model, we demonstrate how it is possible to obtain a lower-variance, albeit biased estimate of the policy gradient.


The setting we describe, much like any other adversarial imitation learning, depends on the ability to model the environment. If that is not the case, then one reluctantly falls back to using conventional behavioral cloning, which learns a policy as a supervised learning problem over state-action pairs. Unfortunately, behavioral-cloning remains the only option, even if we can model significant parts of the environment. Our second work explores whether partial information about an environment's forward dynamics can help develop a behavioral cloning algorithm with features similar to modern simulation-based imitation algorithms. We introduce the Semi-Responsive Induced Markov Decision Process (SRMDP) model as a formulation of solving imitation learning problems using RL when only partial knowledge about the transition kernel is available. We evaluate the method in a series of multi-player games in which the policy of opposing experts cannot be modeled or simulated.

This thesis will also present work done in the field of Maximum-Entropy (MaxEnt) RL. Among the state-of-the-art approaches in off-policy RL, MaxEnt methods achieve superior performance by providing a structured approach to exploration; they add the entropy over actions to the discounted sum of returns, which highly assists with exploration and stability. To date, MaxEnt algorithms only apply to elementary policy distribution classes. For the MaxEnt framework to remain competent in complex control tasks, it should be extended to handle a wide range of policy classes capable of accommodating complicated behaviors. For example, mixture policies. However, using mixture policies in MaxEnt algorithms is not straightforward. The entropy of a mixture model is not equal to the sum of its components, nor does it have a closed-form expression in most cases. In this work, we derive a simple, low-variance mixture-entropy estimator. Equipped with our novel entropy estimator, we derive an algorithmic variant of Soft Actor-Critic (SAC) to the mixture policy case and evaluate it on a series of continuous control tasks.

In our final work, we propose a different MaxEnt setup that optimizes the transition entropy, that is, the entropy of next-states. The alternative regularization criterion can be described by two terms: model-dependent transition entropy and action-redundancy. In particular, we study the latter in both deterministic and stochastic settings and develop tractable approximation methods in a near model-free setup. We construct algorithms to minimize action redundancy and demonstrate their effectiveness in a synthetic environment with multiple redundant actions as well as contemporary benchmarks. Our results suggest that action redundancy is a fundamental problem in RL.