Ph.D Thesis

Ph.D StudentSebban Partush Nimrod
SubjectDifferential Program Analysis
DepartmentDepartment of Computer Science
Supervisor PROF. Eran Yahav
Full Thesis textFull thesis text - English Version


The evolution of software is an emerging research topic, receiving much attention and focus, as every line of code written today is more likely to belong to an existing piece of software. Identifying and modeling the semantic impact of these patches in a sound and precise way is a paramount goal yet to be fulfilled. This work is an effort to generate sound, precise and scalable analysis methods for modeling the differences and measuring the similarity between programs.

This work first focuses on computing semantic differences in numerical programs where the values of variables have no a-priori bounds, and uses abstract interpretation to compute an over-approximation of program differences. Computing differences and establishing equivalence under abstraction requires abstracting relationships between variables in the two programs. Towards that end, a correlating program is first constructed, in which these relationships can be tracked, and then a correlating abstract domain is used to compute a sound approximation of these relationships. The construction of the correlating program relies on syntactic similarity. To better establish equivalence between correlated variables and precisely capture differences, the abstract domain has to represent non-convex information using a partially-disjunctive abstract domain. To balance precision and cost of this representation, the domain over-approximates numerical information while preserving equivalence between correlated variables by dynamically partitioning the disjunctive state according to equivalence criteria.

This approximation of difference can be computed over any interleaving of the two programs. However, the choice of interleaving can significantly affect precision. As the construction of the correlating programs limited the approach to syntactically similar programs, this work further presents a speculative search algorithm that aims to find an interleaving of the two programs with minimal abstract semantic difference. This method is unique as it allows the analysis to dynamically alternate between several interleavings.

Next, to allow a scalable computation of difference for whole programs, change impact analysis (CIA) is employed. CIA refers to the task of determining the set of program statements affected by a program change. The approach shows how to leverage differential invariants to make dataflow based CIA change-semantics aware, thereby improving precision in the presence of semantics preserving changes. A novel anytime algorithm allows costly differential invariant generation to be applied incrementally to refine the set of impacted statements.

Finally, the problem of finding similar procedures in stripped binaries is addressed. A new statistical approach is presented for measuring similarity between two procedures. The notion of similarity makes it possible to find similar code even when it has been compiled using different compilers, or has been slightly modified.

The main idea is to use similarity by composition: decompose the code into smaller comparable fragments, define semantic similarity between fragments, and use statistical reasoning to lift fragment similarity into similarity between procedures.

All of the approaches specified in this work are evaluated using real-world code projects. The evaluation shows the benefit of each approach and its ability to produce a sound and precise description of difference and similarity.