טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentBaras Dorit
SubjectDirect Policy Search in Reinforcement Learning and Synaptic
Plasticity in Biological Neural Networks
DepartmentDepartment of Electrical Engineering
Supervisor Professor Ron Meir


Abstract

Learning performed by natural or artificial agents is characterized by changes in behavior (or competence) resulting from environmental feedback. In order for a change of behavior to take place, some of the internal parameters characterizing the agent must be modified. This study is aimed at addressing learning in networks of spiking neurons.

The theoretical framework for our study is that of partially observed Markov decisions processes POMDP. Consider an agent acting in an environment and modifying its performance based on a reward signal related to its actions. The agent-environment pair is characterized by a state, which changes every time the agent performs an action. Moreover, the agent is unable to access the state explicitly, but only has access to a noisy observation based on the state. The actions performed by the agent are stochastic in nature, depending on internal parameters which define the agent's input-output behavior. In particular, in the context of neural networks, the internal parameters may be the synaptic efficacies as well as any other modifiable neuronal parameters.

A great deal of recent work within the Machine Learning literature has been devoted to finding effective algorithms which enable an agent to improve its performance in such an environment, in terms of maximizing its expected average reward. In this work we focus on approaches based on policy gradient estimation implemented in networks of spiking neuron models. We show that for a large class of biologically plausible neuronal models, a synaptic plasticity rule can be derived which is essentially Hebbian in nature, and strongly resembles the spike time dependent plasticity (STDP) rules studies widely over the past few years. While STDP rules are generally used in an unsupervised context, the present study it shows how to combine local synaptic plasticity rules with a global reward signal in order to improve performance in a reinforcement learning task. Finally, we also relate the derived rule to other self regulating plasticity rules, and demonstrate its effectiveness in a series of simulation studies.