M.Sc Thesis


M.Sc StudentShimkin Izchak
SubjectExploration in Model-Based Reinforcement Learning with
Stochastic Dynamics
DepartmentDepartment of Electrical and Computer Engineering
Supervisor PROF. Ron Meir
Full Thesis textFull thesis text - English Version


Abstract

Reinforcement Learning (RL) is an area of machine learning that deals with the design of agents that can            learn to make sequential decisions in an environment in order to maximize some notion of reward. Prior to recent advancements in the field, the application of RL algorithms to real-world systems was extremely difficult, however, advances in the field, coupled with the increase in computing power, have now enabled RL developers to deal with real-world problems and more complex systems. 


As RL researchers attempt to handle complex systems, they need to address new challenges, among them the need for better and more efficient exploration. RL algorithms require many training samples, a number that can increase exponentially as the system becomes more complex. 

Another challenge is dealing with natural, unconstrained, and often stochastic environments. Most RL algorithms simulate environments that are characterized by deterministic state transitions. For RL to be able to perform complex, life-dependent tasks, there is a need to conduct more research associated with developing algorithms for more complex and inhomogeneous noise.    


In this study, we present a novel model-based exploration algorithm, which we call Stochastic Hallucinate Upper Confidence RL (SH-UCRL). Our algorithm is based on the UCRL scheme and relies on the principle of optimism in the face of uncertainty. Considering a stochastic and possibly heteroscedastic system, we estimate the stochastic system's behavior using an ensemble of Gaussian probability networks. We define a model set based on confidence estimates of state transitions and state uncertainty and use a Model Predictive Control (MPC) optimizer to identify an optimal model and action sequence that promotes trajectories with high uncertainty and controls the agent's behavior. Thus, we can consider our exploration implicit, in the sense that it explores the model space and does not explicitly define an exploration bonus.

Our analysis suggests that designing an exploration that considers stochastic model characteristics can lead to improved performance in heteroscedastic environments.