M.Sc Thesis

M.Sc StudentSouroujon Oren
SubjectIterative Double Clustering: An Information-Theoretic
Algorithm for Clustering Textual Data
DepartmentDepartment of Computer Science
Supervisor PROF. Ran El-Yaniv


This work proposes and studies a new unsupervised meta-clustering algorithm called Iterative Double Clustering (IDC). IDC is an extension of an algorithm called Double Clustering (DC) which clusters data points based on the mutual distribution of the points and their features, using a distance measure derived from an information theoretic principle referred to as the Information Bottleneck method. IDC performs several iterations of the DC algorithm in order to increase clustering accuracy while keeping the running time just a little longer than that of DC. IDC is most suitable for clustering data emerging from mixtures of multinomial distributions such as text documents. Using synthetically generated data, we demonstrate that the IDC algorithm is especially advantageous (over DC and other clustering algorithms) when the data exhibits high attribute noise. Further empirical results show the effectiveness of IDC in real-world text categorization problems. Using the "20 Newsgroup" data set, we surprisingly demonstrate that this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. We also show the algorithm's ability to perform authorship determination by selecting a special set of features. Furthermore, we compare the effect of embedding different basic clustering algorithms such as Agglomerative clustering, Add-C, the classic K-Means, and the recently proposed algorithm called 'Sequential IB' into the meta-clustering framework of IDC. Finally, we propose a natural extension of IDC for (semi-supervised) transductive learning where we are given both labeled and unlabeled examples, and present empirical results showing the plausibility of the extended method in a semi-supervised setting.