Ph.D Thesis

Ph.D StudentArie Friedman
SubjectPrivacy Preserving Data Mining
DepartmentDepartment of Computer Science
Supervisor Full Professors Schuster Assaf
Full Thesis textFull thesis text - English Version


In recent years, privacy preserving data mining has emerged as a very active research area. This field of research studies how knowledge can be extracted from large data stores while maintaining commercial or legislative privacy constraints. Quite often, these constraints pertain to individuals represented in the data stores. While data collectors strive to derive new insights that would allow them to improve customer service and increase their sales, consumers are concerned about the vast quantities of information collected about them and how this information is put to use. Privacy preserving data mining aims to settle these conflicting interests.  The question how these two contrasting goals, mining new knowledge while protecting individuals' privacy, can be reconciled, is the focus of this research. We seek ways to improve the tradeoff between privacy and utility when mining data.

We address the privacy/utility tradeoff problem by considering the privacy and algorithmic requirements simultaneously. We study this problem in the context of two privacy models that attracted a lot of attention in recent years, k-anonymity and differential privacy. Our analysis and experimental evaluations confirm that algorithmic decisions made with privacy considerations in mind may have a profound impact on the accuracy of the resulting data mining models. In chapter 1 we study this tradeoff in the context of k-anonymity. Since k-anonymity was originally defined in the context of relational tables, we begin with extending the definition of k-anonymity with definitions of our own, which can then be used to prove that a given data mining model is k-anonymous. We demonstrate how the definitions can be incorporated within a data mining algorithm to guarantee k-anonymous output. In chapter 2 we focus on decision tree induction and demonstrate that our technique can be used to induce decision trees which are more accurate than those acquired by anonymizing the data first and inducing the decision tree later.

In chapters 3 and 4 we study the privacy/accuracy tradeoff in the context of differential privacy. We show that when applying differential privacy, the method chosen by the data miner to build a classifier could make the difference between an accurate classifier and a useless one, even when the same choice without privacy constraints would have no such effect. Moreover, an improved algorithm can achieve the same level of accuracy and privacy as the naive implementation but with an order of magnitude fewer learning samples.