Ph.D Thesis

Ph.D StudentAvner Orly
SubjectMachine Learning Inspired by Multi-User Communication
DepartmentDepartment of Electrical and Computer Engineering
Supervisor PROF. Shie Mannor
Full Thesis textFull thesis text - English Version


The motivation for the work presented in this thesis comes from the world of cognitive radio networks.  Cognitive radio networks are a collection of challenging problems in the world of communications that concern multi-user, multimedia communication networks, and are of interest due to their complex dynamic nature and stochasticity.  In my research I apply ideas and approaches  from  machine  learning,  in  order  to  model  and  solve  cognitive radio network-oriented problems.

My research focuses on decision theoretic aspects of cognitive radio networks.   The   first  setting  consists  of  a  single  user  acting  in  an  unfamiliar environment,  trying to learn the best transmission profile while maintaining an operational constraint (e.g., average power consumption below some threshold).  I model the problem using a stochastic multi-armed bandit with a pathwise constraint on the acquired penalty. My solution is an index-based steering policy that maximizes the average reward while steering the average penalty into a set representing the constraint.  I provide performance guarantees for both penalty and reward, and demonstrate results through simulations.

The second part of my work examines decoupling exploration and exploitation.  The user, whose goal is to maximize the acquired reward, may choose two arms in each round:  one for exploration, and the other for exploitation.  Thus, the user actually acquires reward from the exploited arm, but observes the reward of the explored arm.  The paper offers a thorough theoretical analysis, showing considerable improvement over existing algorithms when there is a small subset of attractive arms or when the identity of the best arm changes over time.  A simulation comparing the algorithm to other approaches complements the analysis and demonstrates the results.

In the following section I turn to the multi-user setting.  Channels are still represented by the arms of a stochastic multi-armed bandit, but they are targeted by several users, competing over the same resources.  To maximize their  reward  users  must  learn  the  characteristics  of  the  different  channels while minimizing the number of collisions.  Users are not allowed to communicate directly and must manage without any form of cooperation, coordination or central control.  I propose an algorithm that combines epsilon-greedy learning with a collision avoidance mechanism. I analyze its regret with respect to the system-wide optimum and show that sub-linear regret can be obtained.   My  main  innovation  is  the  fact  that  the  users'  policy  does  not require knowledge of the number of users in the network.  As a result, my algorithm can also handle scenarios with a variable number of users. My last project extends the multi-user setting to the challenging scenario in which the characteristics of each channel are different for each user.  As before, information exchange between the users must be limited to a minimum.   Achieving  the  optimum  in  terms  of  regret  is  not  possible  without explicit communication, and instead I focus on reaching a stable marriage configuration.   I develop an algorithm for learning a stable configuration, including an extension for a variable number of users.  I further offer both convergence guarantees and experiments, including comparison to state-of-the-art algorithms.