Ph.D Thesis


Ph.D StudentAlon Uri
SubjectMachine Learning for Programming Language Processing
DepartmentDepartment of Computer Science
Supervisor PROF. Eran Yahav
Full Thesis textFull thesis text - English Version


Abstract

Over the past decade, the deep learning revolution, driven by artificial neural networks, has transformed a broad range of areas in computer science such as computer vision, speech recognition, and natural language processing (NLP). In parallel, the number of open-source, publicly available codebases has grown dramatically, enabling the application of neural networks to a wide range of programming-related tasks, a field that we dub Programming Language Processing (PLP).


Yet, the problem of representing programs in machine- and deep-learning models remained an open question. Obviously, programs do not have a straightforward tensorial representation as images do. Although a program can be represented as a sequence of tokens like a natural language text, programs are far more structured than free-text, since programs must comply with a rigid and rich syntax, defined by a context-free grammar. Furthermore, every programming language has predefined semantics that describe what syntactically valid programs mean and do.


This dissertation focuses on this general problem: representing programs in machine- and deep-learning models, in ways that facilitate learning while capturing as much information as possible, and keeping the model as general as possible.

This thesis introduces the AST paths approach, which represents programs using paths from the program's Abstract Syntax Tree (AST). The AST paths representation allows building powerful and accurate neural models, while keeping them lightweight and scalable. Specifically, this thesis shows how these models can be trained on millions of examples, in tasks that include predicting properties of individual program elements, predicting properties of code snippets, generating a natural language sequence from a given source code snippet, and generating code completions. These models were publicly released as online interactive demos along with open-source implementations and datasets.

Some of the models, such as code2vec and code2seq, are highly popular and widely used in academia and industry.



Finally, this dissertation studies theoretical differences between different program representations. These studies revealed broader foundational limitations of another popular representation, the graph neural network framework.