טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentBar-Or Amir
SubjectHierarchical Decision Tree Induction for Highly Dimensional
Data in Large-Scale Distributed Systems
DepartmentDepartment of Computer Science
Supervisor Professor Assaf Schuster


Abstract

Classification based on decision trees is one of the important problems in data mining and has applications in many fields, including fraud detection, bioinformatics, and retail.

Classical data mining is usually conducted in a data warehouse, where data is collected into a single system and processed by various decision support systems.

In recent years, database systems have become highly distributed, and distributed system paradigms such as federated and peer-to-peer databases are being adopted.

In this thesis, we consider the problem of inducing decision trees in a large distributed network for highly dimensional distributed databases.

Our work is motivated  by the existence of highly dimensional datasets in many domains --- medical information and bioinformatics, for example.

Current decision tree algorithms require high communication bandwidth, which is not likely to exist in large-scale distributed systems.

We present a novel algorithm that sharply reduces the communication overhead by sending just a fraction of the statistical data, which is nevertheless sufficient to derive the exact same decision tree learned by a sequential learner on all the data in the network.

The reduction in communication is due to the fact that statistical data is only collected for attributes that are considered promising candidates for classification.

An interesting example for such a distributed system with highly dimensional data exists in the domain of medical informatics, where medical databases contain genomic and clinical data.

The algorithm is tested on a simulation of this environment and shows a reduction of more than two orders of magnitude in the number of sent bytes, with only a small increase in the number of messages.