M.Sc Thesis

M.Sc StudentDorfman Ron
SubjectOffline Meta Reinforcment Learning of Efficient Exploration
DepartmentDepartment of Electrical and Computer Engineering
Supervisor DR. Aviv Tamar
Full Thesis textFull thesis text - English Version


Reinforcement Learning (RL) is an area of machine learning, concerned with designing agents that can learn to make sequential decisions in an environment, in order to maximize some notion of reward. A key challenge in RL is the exploration-exploitation trade-off, which tackles the problem of discovering high-reward strategies. In each time step, the agent is faced with a question - should it take the best action, given his current knowledge (i.e., exploit), or should it take an action to gather more information about the environment (i.e., explore).

Meta-RL tackles the problem of learning quickly in a new environment, by considering a distribution over possible environments, and having access to a set of training environments, sampled from this distribution. Intuitively, a meta-RL agent can learn regularities in the environments, which allow quick learning in any environment that shares a similar structure. Most meta-RL studies have focused on the online setting, where, during training, the meta-RL agent is continually updated using data collected from running it in the training environments.

In this thesis, we consider the following problem, which we term Offline Meta Reinforcement Learning (OMRL): given the complete training histories of N conventional RL agents, trained on N different tasks, design a learning agent that can quickly maximize reward in a new, unseen task from the same task distribution. In particular, while each conventional RL agent explored and exploited its own different task, the OMRL agent must identify regularities in the data that lead to effective exploration/exploitation in the unseen task. To solve OMRL, we take a Bayesian RL (BRL) view, and seek to learn a Bayes-optimal policy from the offline data. We extend the recently proposed VariBAD BRL algorithm to the off-policy setting, and demonstrate learning of Bayes-optimal exploration strategies from offline data using deep neural networks. Furthermore, when applied to the online meta-RL setting (agent simultaneously collects data and improves its meta-RL policy), our method is significantly more sample efficient than the conventional VariBAD.