|Ph.D Thesis||Department of Electrical Engineering|
|Supervisor:||Prof. Meir Ron|
|Full Thesis text|
Actor-critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. Interestingly, there is growing evidence that actor-critic approaches based on phasic dopamine signals play a key role in biological learning through cortical and basal ganglia loops.
In this work we investigate actor-critic algorithms from different perspectives, ranging from machine learning to biology. The thesis is divided into four parts. In the first part we show how actor-critic algorithms may explain synaptic mechanisms observed in biological neural networks. The suggested model connects the spike time dependent plasticity for synaptic efficacy change and the dopamine neuromodulator which is believed to represent the so called temporal difference signal. We show that such an algorithm possesses a solid theoretical basis and demonstrate its capabilities in simulations.
In the second part, we show that contrary to previous thought, actor-critic algorithms may operate on a single time scale. We show that actor-critic algorithms with a function approximation converge under a wide range of conditions, e.g., a critic that carries out a full TD($\lambda$) algorithm. This type of convergence is possible under the requirement that the actor policy parameters converge to a controlled neighborhood of the optimal value rather than to the optimal value it self.
In the last part we present a class of actor-critic algorithms where the representative basis adapts to the problem at hand, therefore, achieving better performance than non-adaptive algorithms. We demonstrate the applicability of these algorithms under several fitness criteria. In addition, these algorithms demonstrate the wider concept of adaptive bases which possess a better performance on the one hand and still maintain a relatively simple analysis on the other hand. The power of these algorithms stems from the idea of augmenting the original algorithm by an adaptive mechanism. In addition, variants of adaptive Q-learning with function approximation and adaptive optimal stopping time algorithms are presented. Since such algorithms are involved with non-differentiable operators, methods using a soft approximation for these operators are applied.
In the appendix we summarize some biological experimental work inspired by actor-critic algorithms, where reinforcement learning protocols are applied to cultured neurons grown in-vitro. Discussion of experimental results is provided and suggestions for future work are given.