M.Sc Thesis

M.Sc StudentAdam Guy
SubjectApproximate Thompson Sampling for Improved Exploration in
Deep Reinforcement Learning
DepartmentDepartment of Electrical and Computer Engineering
Supervisor PROF. Nahum Shimkin
Full Thesis textFull thesis text - English Version


Deep Reinforcement Learning (DRL) is an active area of research within the field of machine learning. The combination of reinforcement learning algorithms with the representation capabilities of Deep Neural Networks(DNN) has led to unprecedented performance across a wide variety of control tasks, ranging from video games to robotics.

A significant milestone in this field was the Deep Q-Network (DQN) algorithm. The DQN algorithm generally operates as follows: An agent interacts with an unfamiliar environment through its exploration strategy and stores its experience in a buffer. Then, learning algorithm uses the experience in the buffer to estimate the optimal Q-function through a DNN, which is phrased the Q-network. The Q-function being the state-action value function, which represents the expected cumulative reward an agent can get from taking action a while being in state s and then follow the optimal policy.

The learning algorithm is therefore highly dependent on the experience the agent gains from interacting with the environment, which depends on the agent’s exploration strategy. Despite this observation, DQN uses a simple, yet common, exploration strategy named ε-greedy. This strategy selects the optimal action according to current Q-estimate with probability 1 − ε, thus exploiting the knowledge gained by the agent so far. With probability ε, it takes a random action, which will hopefully lead to new informative experience. This strategy is inefficient both empirically and theoretically. Its main problem is that it explores the environment randomly, without using the observed data to improve exploration.

Thompson Sampling (TS) is a general algorithm for online decision problems, with many implementations for specific tasks. This algorithm is designed to select actions sequentially in a manner that balances exploration and exploitation.

In this thesis, we suggest a different approach for exploration in the DQN algorithm inspired by Thompson Sampling. We use the state representation learned by the Q-network, to estimate both Q-values and fitting confidence sets. We use these confidence sets to balance exploration and exploitation in a similar manner to TS. As the representation learned by the Q-network is continually changing, we suggest a method to transfer information between old and new representations. We demonstrate our method on a toy problem as well as on five Atari benchmarks. We show its superiority over the DQN algorithm in speed and performance, as well as competitive results with the recent RAINBOW algorithm which incorporates several DQN extensions.