M.Sc Thesis | |

M.Sc Student | Peer Oren |
---|---|

Subject | Using Ensembles for Reducing the Estimation Bias in Q-Learning Algorithms |

Department | Department of Electrical and Computer Engineering |

Supervisor | PROF. Ron Meir |

Full Thesis text |

Reinforcement Learning is an area of machine learning in which an agent faces a sequential decision-making problem in some unknown or partially known environment. The agent interacts with the environment in order to learn a policy that maximizes some (generally unknown) reward signal.

Q-learning, a common reinforcement learning algorithm, suffers from over-estimation

bias of the Q-function due to the maximization term in the Optimal
Bellman Operator that is used to update the estimator of the Q-function. This
bias may lead to sub-optimal behavior in some scenarios due to non-uniform *misleading
belief *of high states-value. Double-Q-learning, a commonly used
modification of Q-learning tackles this issue by utilizing two estimators, yet
results in an under-estimation bias. Similar to over-estimation in Q-learning, in
certain scenarios, the under-estimation bias may degrade performance.

In this work, we focus on the issue of approximation errors of the next-states Q-values

caused by the estimation of the maximal expectation in the optimal Bellman operator,

and introduce a new bias-reduced algorithm called ‘Ensemble Boot-strapped Q-Learning’ (EBQL), a natural extension of Double-Q-learning to ensembles.

We analyze our method both theoretically and empirically. Theoretically, prove that

EBQL-like updates yield lower mean-squared-error when estimating the maximal mean of a set of independent random variables, compering to Q-Learning and Double Q-Learning estimation schemes. Empirically, we show that there exist domains where both over and under-estimation result in sub-optimal performance. When averaging the performance of an algorithm over these domains, EBQL performs robustly and outperforms the tabular versions of both Q-learning and Double-Q-learning. Finally, We demonstrate the superior performance and scalability of a deep-Reinforcement-Learning variant of EBQL over other algorithms for a high-dimension, challenging benchmark - a suite of ATARI games.