M.Sc Thesis

M.Sc StudentZilberstein Meital
SubjectCode Similarity via Natural Language Descriptions
DepartmentDepartment of Computer Science
Supervisor PROF. Eran Yahav
Full Thesis textFull thesis text - English Version


Code similarity is a central challenge in many programming related applications, such as code search, automatic translation, and programming education. We present a novel approach for establishing the similarity of code fragments by computing textual similarity between their corresponding textual descriptions. In order to find textual descriptions for code fragments, we leverage the collective knowledge captured in question answering sites, blog posts and other sources. Because our notion of code similarity is based on similarity of corresponding textual descriptions, our approach can determine semantic relatedness and similarity of code across different libraries and even across different programming languages, a task considered extremely difficult using traditional approaches. To support the text-based similarity function, we also apply static analysis on the code fragments themselves and use it as another measure for similarity.

We have implemented our approach using data obtained from the popular Q&A site, Stackoverflow, and used it to determine the similarity of 100,000 pairs of code fragments. To evaluate our approach, we implemented a crowdsourcing system that allows users to label the similarity of code fragments, and used it to build a massive corpus of 6,500 labeled program pairs. Our results show that our technique is effective in determining similarity, and gains high precision, recall and accuracy.