Ph.D Thesis | |
Ph.D Student | Wolff Ran |
---|---|
Subject | Data Mining in Large-Scale Distributed Systems |
Department | Department of Computer Science | Supervisor | PROF. Assaf Schuster |
The economic value of data mining is today well established. One of the most popular data
mining techniques is association rule mining (ARM), which is the automatic discovery of
pairs of element sets that tend to appear together in a common context. An example would
be to discover that the purchase of certain items (say tomatoes and lettuce) in a
supermarket transaction usually implies that another set of items (salad dressing) is also
bought in that same transaction.
The interest in distributed ARM algorithms stems from two main motivations: One is the
size of databases that have to be processed and the desire to parallelize the disk I/O required
to access them. The other is the ever-growing number of applications in which databases are
already distributed when they are generated and cannot be collected for reasons such as the
lack of suf_cient centralized storage, privacy concerns, or sheer performance considerations.
In this research we focus on the emerging world of peer-to-peer systems and grid computation.
Peer-to-peer systems today contain millions of connected computers. Grid systems, likewise,
already contain tens of thousands of computers each. Both these technologies have been used
for the decentralized manifestation of a huge data storage systems. Considering the reputation
of data mining as the killer application of databases, it is only likely that they would be data
mined had there been algorithms capable of addressing such systems. For this purpose we have
developed a series of data mining algorithms that are extremely scalabile and can withstand
partial failures.