|Ph.D Student||Feldman Zohar|
|Subject||Monte-Carlo Algorithms for Online Action Planning in|
Markov Decision Processes
|Department||Department of Industrial Engineering and Management||Supervisor||Professor Carmel Domshlak|
|Full Thesis text|
Markov decision processes (MDPs) is the most popular mathematical framework for modeling problems of sequential decision making. The focus of our work here is on online planning for MDPs: When planning online, the agent is focusing on
its current state only, deliberates about the set of possible policies from that state onwards and, when interrupted, uses the outcome of that exploratory deliberation to choose what action to perform next. Formally, the performance of algorithms for online planning is assessed in terms of simple regret, the agent's expected performance loss when the chosen action, rather than the optimal one, is followed.
State-of-art algorithms for online planning in general MDPs are either best effort, or guarantee only polynomial-rate reduction of simple regret over time. In this work, we introduce a new Monte-Carlo tree search (MCTS) algorithm, BRUE, that guarantees exponential-rate and smooth reduction of simple regret. At a high level, BRUE is based on a novel state-space sampling scheme that implements a concept of ``separation of concerns'' according to which the strategies that determine which estimates to update within each sample and the strategies that determine how to update them are independent.
Although BRUE is very effective in practice, unlike other state-of-art algorithms such as UCT, it is not geared towards identifying reasonably good solutions right at the go. We take a stand on the individual strengths of these two classes of algorithms, and show how they can be effectively connected. We then rationalize a principle of ``selective tree expansion", and suggest a concrete implementation of this principle within MCTS.
Finally, we formally compare BRUE with MAXBRUE, a variant of BRUE that employs the principal of dynamic programming (Bellman) to estimate the actions values rather than averaging samples. We further reveal relative pros and cons of the two update methods that get hidden by the worst-case bounds.