Ph.D Thesis

Ph.D StudentWolff Ran
SubjectData Mining in Large-Scale Distributed Systems
DepartmentDepartment of Computer Science
Supervisor PROF. Assaf Schuster


The economic value of data mining is today well established. One of the most popular data

mining techniques is association rule mining (ARM), which is the automatic discovery of

pairs of element sets that tend to appear together in a common context. An example would

be to discover that the purchase of certain items (say tomatoes and lettuce) in a

supermarket transaction usually implies that another set of items (salad dressing) is also

bought in that same transaction.

The interest in distributed ARM algorithms stems from two main motivations: One is the

size of databases that have to be processed and the desire to parallelize the disk I/O required

to access them. The other is the ever-growing number of applications in which databases are

already distributed when they are generated and cannot be collected for reasons such as the

lack of suf_cient centralized storage, privacy concerns, or sheer performance considerations.

In this research we focus on the emerging world of peer-to-peer systems and grid computation.

Peer-to-peer systems today contain millions of connected computers. Grid systems, likewise,

already contain tens of thousands of computers each. Both these technologies have been used

for the decentralized manifestation of a huge data storage systems.  Considering the reputation

of data mining as the killer application of databases, it is  only likely that they would be data

mined had there been algorithms capable of addressing such systems. For this purpose we have

developed a series of data mining algorithms that are extremely scalabile and can withstand

partial failures.