טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentIlya Kravchik
SubjectLearning Drifting Data Using Selective Sampling
DepartmentDepartment of Electrical Engineering
Supervisor Professor Crammer Yacov
Full Thesis textFull thesis text - English Version


Abstract

Online learning is a well known sequential learning setting, providing algorithms which perform successfully for a variety of problems.  In some on-line learning settings, the optimal function from a family of functions learned, changes over time, either by gradually drifting or by an abrupt switch. This occurs due to changes in data distribution or other conditions in real-life problems with which on-line learning algorithms deal, such as information filtering, sentiment analysis, market analysis and more

.

In the selective sampling approach to on-line learning settings, not all labels or values of input instances are queried, while algorithm performance remains unharmed. Selective sampling suggests a concept of confidence: on instances algorithm is confident about, prediction should be close to optimal and true value can remain unknown (and thus no query is needed).


In this work it is proposed to use the concept of confidence on instances, which comes from selective sampling, to overcome drifting data setting in on-line learning problem. This research focuses on the case of a sudden, abrupt switch. We use dummy estimators, constructed from a recent time window to evaluate algorithm performance. If the estimator of the algorithm differs significantly from dummy estimator on instances it has high confidence about, we assume that it is due a switch, and the algorithm is restarted to avoid irrelevant and incorrect former data. Moreover, besides a scheme for detecting switches, the algorithm provides assurance that no false detections occur and assures that if a switch remains undetected then it causes only a small harm.


Analysis for on-line learning linear classification and linear regression settings is presented, followed by results achieved in simulations on synthetic data, which support developed theoretical results.