|M.Sc Student||Cohen Snir|
|Department||Department of Computer Science||Supervisor||PROF. Shie Mannor|
|Full Thesis text|
Markov Decision Processes (MDPs) are commonly used to model Reinforcement Learning problems in which an agent interacts with an unknown environment aiming to maximize a given criterion. In many real-world applications, a decision maker’s task is to formulate a strategy for a fixed number of steps to achieve a certain objective. For example, in adaptive routing, packets must be sent through a network of routers before the maximum number of hops is reached and sending fails, while reducing the latency to a minimum. This kind of problems are referred to as finite-horizon or fixed-horizon. More examples can be found in portfolio management, power systems management and games.
Optimistic methods for solving Reinforcement Learning problems are very popular in the literature. In practice, however, these methods show inferior performance compared to other methods, such as Posterior Sampling. We propose a novel concept of Restricted Optimism to balance the well-known exploration vs. exploitation trade-off for finite-horizon MDPs. We harness Posterior Sampling to construct two algorithms in the spirit of our Restricted Optimism principle. We provide theoretical guarantees for them and demonstrate through experiments that there exists a trade-off between the average cumulative regret suffered by the agent and the variance. The agent can influence this trade-off by tuning the level of optimism carried out by our proposed algorithms through a regularization parameter.