M.Sc Thesis


M.Sc StudentMargulis Igor
SubjectOn Anomaly Detection in Tabular Data
DepartmentDepartment of Computer Science
Supervisors PROF. Ran El-Yaniv
ASSOCIATE PROF. Yuval Filmus
Full Thesis textFull thesis text - English Version


Abstract

We consider the problem of anomaly detection in tabular data and present a modular

framework for anomaly detection based on classification of self-labelled data. Given a

set of records, all considered as belonging to a “normal” class (e.g., measurements corresponding

to some physical phenomenon of interest), we demonstrate how a deep neural

model appropriate for the classification of tabular data can be incorporated into the

detection scheme for sorting out anomalous records (e.g., measurements corresponding

to some background signal).

Tables are a very popular way of presenting data so clearly anomaly detection in

tabular data is of utmost importance. The task of anomaly detection is challenging

due to heterogeneity of data stretching across various facets of real-world phenomena

captured by measurements and ordered in the form of tables.

The standard and intuitive approach to the problem of anomaly detection is learning

the model of normality. Having acquired an understanding of normal patterns, the

system can track down the non-conforming patterns and declare them to be anomalies.

Classic approaches to solving the anomaly detection problem usually do not perform

well on high-dimensional data, which in general can be the case for tabular data in many

applications, e.g., medical records of patients can include hundreds of measured parameters

from blood analysis, immune system status, genetic background, nutrition, alcohol

and tobacco consumption, treatments and diagnosed diseases. To circumvent this issue,

many recent approaches employ some mechanism for dimensionality reduction of the

data and apply anomaly detection techniques on the low-dimensional representation

space.

In contrast to these methods, the main idea behind the classification-based scheme,

presented in this thesis, is to train a multiclass classifier to distinguish between several

dozens of transformations applied on all the given “normal” records.

The data representation learned by the model turns out to be useful in identifying,

at test time, anomalous records based either on the softmax activation (i.e., the output

of the classifier, which represents the probability that an input record belongs to each

class) statistics of the classification model when applied to transformed records, or

distance-based statistics calculated for the produced representation of an input record

with respect to the cluster centers of learned representations. To validate our solution,

we present experiments using the proposed framework.