M.Sc Thesis | |
M.Sc Student | Margulis Igor |
---|---|
Subject | On Anomaly Detection in Tabular Data |
Department | Department of Computer Science | Supervisors | PROF. Ran El-Yaniv |
ASSOCIATE PROF. Yuval Filmus | |
Full Thesis text | ![]() |
We consider the problem of anomaly detection in tabular data and present a modular
framework for anomaly detection based on classification of self-labelled data. Given a
set of records, all considered as belonging to a “normal” class (e.g., measurements corresponding
to some physical phenomenon of interest), we demonstrate how a deep neural
model appropriate for the classification of tabular data can be incorporated into the
detection scheme for sorting out anomalous records (e.g., measurements corresponding
to some background signal).
Tables are a very popular way of presenting data so clearly anomaly detection in
tabular data is of utmost importance. The task of anomaly detection is challenging
due to heterogeneity of data stretching across various facets of real-world phenomena
captured by measurements and ordered in the form of tables.
The standard and intuitive approach to the problem of anomaly detection is learning
the model of normality. Having acquired an understanding of normal patterns, the
system can track down the non-conforming patterns and declare them to be anomalies.
Classic approaches to solving the anomaly detection problem usually do not perform
well on high-dimensional data, which in general can be the case for tabular data in many
applications, e.g., medical records of patients can include hundreds of measured parameters
from blood analysis, immune system status, genetic background, nutrition, alcohol
and tobacco consumption, treatments and diagnosed diseases. To circumvent this issue,
many recent approaches employ some mechanism for dimensionality reduction of the
data and apply anomaly detection techniques on the low-dimensional representation
space.
In contrast to these methods, the main idea behind the classification-based scheme,
presented in this thesis, is to train a multiclass classifier to distinguish between several
dozens of transformations applied on all the given “normal” records.
The data representation learned by the model turns out to be useful in identifying,
at test time, anomalous records based either on the softmax activation (i.e., the output
of the classifier, which represents the probability that an input record belongs to each
class) statistics of the classification model when applied to transformed records, or
distance-based statistics calculated for the produced representation of an input record
with respect to the cluster centers of learned representations. To validate our solution,
we present experiments using the proposed framework.