|M.Sc Student||Dan Trock|
|Subject||Sampling-Based Distributed Mining of Association Rules|
|Department||Department of Electrical Engineering||Supervisor||Full Professor Moses Yoram|
The economic value of data mining is today well established . Most large organizations regularly practice data mining techniques . One of the most popular techniques is association rule mining (ARM), which is the automatic discovery of pairs of element sets that tend to appear together in a common context . An example would be to discover that the purchase of certain items (say tomatoes and lettuce) in a supermarket transaction usually implies that another set of items (salad dressing) is also bought in that same transaction. Application domains for ARM include recommendation service, customer segmentation, catalog design, store layout and more . Since the introduction of the problem, efficient sequential algorithms were presented for its solution. A variety of Distributed ARM (D-ARM) algorithms were also shown in the context of parallelization and IO-speedup . Unfortunately, all the distributed algorithms presented scan the database multiple times which is still the main show-stopper for this task . In this work, we present a new algorithm that is the first D-ARM algorithm to perform a single scan over the database . As such, its performance is unmatched by any previous algorithm . The algorithm is based on the sequential Sampling algorithm, and demonstrates superlinear speedup with the number of computing nodes . Scale-up experiments over standard benchmarks demonstrate stable run time regardless of the number of computers . Theoretical analysis reveals a tighter bound on error probability than the one shown in the corresponding sequential algorithm .