M.Sc Thesis

M.Sc StudentNachmias Elad
SubjectHierarchical Code Representations
DepartmentDepartment of Computer Science
Supervisor PROF. Eran Yahav
Full Thesis textFull thesis text - English Version


In recent years, various deep-learning-based techniques have been suggested for reasoning about programs and tackling code-related prediction tasks. A common concern for these methods is the design of the code encoder, which defines (i) the representation of the input code snippet and (ii) the neural computational architecture that propagates information on top of this representation. Recent advances in the study of code representations suggest exploiting the unambiguous codes underlying structure. Indeed, structure-driven modelings became a common practice, enriching the representation with syntactic and semantic relations.

However, the standard graph-based approach relies on message-passing protocols, restricting information propagation over the entire program due to the GNNs’ over-smoothing issue. The paths-based approach is claimed to be a remedy, as it exploits the expressiveness of sequential models. Yet, even though it is restricted to syntactic relations only, it incurs unbearable computational complexity for lengthy procedures and thus becomes unscalable for real-life scenarios.

We address the problem of predicting which variables to log at a given logging insertion point within the input program. This task arises challenges like following complicated sequences of latent semantic relations, which are sparsely scattered over lengthy procedures. We show that the present solutions struggle with this setting.

We present a novel code representation approach designed to model sequences of relations. It involves semantic and syntactic characteristics modeled by paths in the procedures control-flow graph (CFG) and paths in the abstract syntax tree (AST), respectively. Our work is the first to leverage the powerful paths-based modeling approach for semantic and syntactic relations together.

Our methodology relies on two novelties. The first is a hierarchic encoding technique for code, which decomposes the entire code structure into local and global levels. Encoding each structural level individually (i) brings the procedures textually-distant entities closer, (ii) supports lengthy procedures, and (iii) models compositions of local relations. The second novelty is a graph encoding framework called expand-collapse, which extends the paths-based encoder by allowing information to flow across paths.

We compare our approach to a wide variety of other known representations. Our experiments show that our model is more effective than others by up to 10% (accuracy). We show how the lack of scalability in other rich representations causes them to fail, even when compared to the degenerated non-structured approach. Our method enhances them by reducing 95% of the processed tokens without any loss of information. We provide a thorough ablation study to justify the benefit of different components of our solution.