M.Sc Thesis

M.Sc StudentPeer Oren
SubjectUsing Ensembles for Reducing the Estimation Bias in
Q-Learning Algorithms
DepartmentDepartment of Electrical and Computer Engineering
Supervisor PROF. Ron Meir
Full Thesis textFull thesis text - English Version


Reinforcement Learning is an area of machine learning in which an agent faces a sequential decision-making problem in some unknown or partially known environment. The agent interacts with the environment in order to learn a policy that maximizes some (generally unknown) reward signal.

Q-learning, a common reinforcement learning algorithm, suffers from over-estimation

bias of the Q-function due to the maximization term in the Optimal Bellman Operator that is used to update the estimator of the Q-function. This bias may lead to sub-optimal behavior in some scenarios due to non-uniform misleading belief of high states-value. Double-Q-learning, a commonly used modification of Q-learning tackles this issue by utilizing two estimators, yet results in an under-estimation bias. Similar to over-estimation in Q-learning, in certain scenarios, the under-estimation bias may degrade performance.

In this work, we focus on the issue of approximation errors of the next-states Q-values

caused by the estimation of the maximal expectation in the optimal Bellman operator,

and introduce a new bias-reduced algorithm called ‘Ensemble Boot-strapped Q-Learning’ (EBQL), a natural extension of Double-Q-learning to ensembles.

We analyze our method both theoretically and empirically. Theoretically, prove that

EBQL-like updates yield lower mean-squared-error when estimating the maximal mean of a set of independent random variables, compering to Q-Learning and Double Q-Learning estimation schemes. Empirically, we show that there exist domains where both over and under-estimation result in sub-optimal performance. When averaging the performance of an algorithm over these domains, EBQL performs robustly and outperforms the tabular versions of both Q-learning and Double-Q-learning. Finally, We demonstrate the superior performance and scalability of a deep-Reinforcement-Learning variant of EBQL over other algorithms for a high-dimension, challenging benchmark - a suite of ATARI games.