טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentGabel Moshe
SubjectUnsupervised Anomaly Detection in Large Datacenters
DepartmentDepartment of Computer Science
Supervisors Professor Assaf Schuster
Dr. Ran Gilad-Bachrach
Full Thesis textFull thesis text - English Version


Abstract

Unexpected machine failures, with their resulting service outages and data loss, pose challenges to datacenter management.

Complex online services run on top of datacenters that often contain thousands of machines. With so many machines, failures are common, and automatic monitoring is essential.

Many existing failure detection techniques do not adapt well to the unpredictable and dynamic environment of large-scale online services. They rely on static rules, obsolete historical logs or costly (often unavailable) training data. More flexible techniques are impractical, as they require on deep domain knowledge, unavailable console logs, or intrusive service modifications.

We hypothesize that many machine failures are not a result of abrupt changes but rather a result of a long period of degraded performance. This is confirmed in our experiments on large real-world services, in which over 20% of machine failures were preceded by such latent faults.

We propose a proactive approach to failure prevention by detecting  performance anomalies without prior knowledge about the monitored service. We present a novel framework for statistical latent fault detection  using only ordinary machine counters collected as standard practice. The main assumption in our framework is that that at any point in time, most machines function well. By comparing machines to each other, we can then find those machines that exhibit latent faults.

We demonstrate three detection methods within the framework, and apply them to several real-world production services. The derived tests are domain-independent and unsupervised, require neither background information nor parameter tuning,

and scale to very large services. We prove strong guarantees on the false positive rates of our tests, and show how they hold in practice.