Ph.D Thesis | |

Ph.D Student | Friedman Arie |
---|---|

Subject | Privacy Preserving Data Mining |

Department | Department of Computer Science |

Supervisor | PROF. Assaf Schuster |

Full Thesis text |

In recent years, privacy preserving data mining has emerged as a very active research area. This field of research studies how knowledge can be extracted from large data stores while maintaining commercial or legislative privacy constraints. Quite often, these constraints pertain to individuals represented in the data stores. While data collectors strive to derive new insights that would allow them to improve customer service and increase their sales, consumers are concerned about the vast quantities of information collected about them and how this information is put to use. Privacy preserving data mining aims to settle these conflicting interests. The question how these two contrasting goals, mining new knowledge while protecting individuals' privacy, can be reconciled, is the focus of this research. We seek ways to improve the tradeoff between privacy and utility when mining data.

We address the privacy/utility
tradeoff problem by considering the privacy and algorithmic requirements
simultaneously. We study this problem in the context of two privacy models that
attracted a lot of attention in recent years, *k*-anonymity and
differential privacy. Our analysis and experimental evaluations confirm that
algorithmic decisions made with privacy considerations in mind may have a
profound impact on the accuracy of the resulting data mining models. In chapter
1 we study this tradeoff in the context of *k*-anonymity. Since *k*-anonymity
was originally defined in the context of relational tables, we begin with
extending the definition of *k*-anonymity with definitions of our own,
which can then be used to prove that a given data mining model is *k*-anonymous.
We demonstrate how the definitions can be incorporated within a data mining
algorithm to guarantee *k*-anonymous output. In chapter 2 we focus on
decision tree induction and demonstrate that our technique can be used to
induce decision trees which are more accurate than those acquired by
anonymizing the data first and inducing the decision tree later.

In chapters 3 and 4 we study the privacy/accuracy tradeoff in the context of differential privacy. We show that when applying differential privacy, the method chosen by the data miner to build a classifier could make the difference between an accurate classifier and a useless one, even when the same choice without privacy constraints would have no such effect. Moreover, an improved algorithm can achieve the same level of accuracy and privacy as the naive implementation but with an order of magnitude fewer learning samples.