|M.Sc Thesis||Department of Electrical Engineering|
|Supervisor:||Prof. Meir Ron|
|Full Thesis text|
In reinforcement learning an agent uses online feedback from the environment and prior knowledge in order to adaptively select an effective policy. Model free approaches address this task by directly mapping external and internal states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which combines them into a single algorithm, which switches between a model based and a model free mode, depending on the current environmental state and on the status of the agent's knowledge. Our method relies on a novel definition for a partially known Markov decision process, and an estimator that incorporates such knowledge in order to reduce uncertainty in model free algorithms of the stochastic approximation type. We prove that such an approach leads to improved policy evaluation whenever sufficiently accurate environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach and suggest its efficacy in boosting policy gradient learning.