Ph.D Thesis | |

Ph.D Student | Sebban Partush Nimrod |
---|---|

Subject | Differential Program Analysis |

Department | Department of Computer Science |

Supervisor | PROF. Eran Yahav |

Full Thesis text |

The evolution of software is an emerging research topic, receiving much attention and focus, as every line of code written today is more likely to belong to an existing piece of software. Identifying and modeling the semantic impact of these patches in a sound and precise way is a paramount goal yet to be fulfilled. This work is an effort to generate sound, precise and scalable analysis methods for modeling the differences and measuring the similarity between programs.

This work first focuses on computing
semantic differences in numerical programs where the values of variables have
no a-priori bounds, and uses abstract interpretation to compute an
over-approximation of program differences. Computing differences and
establishing equivalence under abstraction requires abstracting relationships
between variables in the two programs. Towards that end, a *correlating
program* is first constructed, in which these relationships can be tracked,
and then a *correlating abstract domain* is used to compute a sound
approximation of these relationships. The construction of the correlating
program relies on *syntactic similarity*. To better establish equivalence
between correlated variables and precisely capture differences, the abstract
domain has to represent non-convex information using a partially-disjunctive
abstract domain. To balance precision and cost of this representation, the
domain over-approximates numerical information while preserving equivalence
between correlated variables by dynamically partitioning the disjunctive state
according to equivalence criteria.

This approximation of difference can
be computed over any interleaving of the two programs. However, the choice of
interleaving can significantly affect precision. As the construction of the
correlating programs limited the approach to syntactically similar programs,
this work further presents a *speculative search algorithm* that aims to
find an interleaving of the two programs with minimal abstract semantic
difference. This method is unique as it allows the analysis to dynamically
alternate between several interleavings.

Next, to allow a scalable computation
of difference for whole programs, change impact analysis (CIA) is employed. CIA
refers to the task of determining the set of program statements affected by a
program change. The approach shows how to leverage differential invariants to
make dataflow based CIA *change-semantics aware*, thereby improving
precision in the presence of semantics preserving changes. A novel anytime
algorithm allows costly differential invariant generation to be applied
incrementally to refine the set of impacted statements.

Finally, the problem of finding similar procedures in stripped binaries is addressed. A new statistical approach is presented for measuring similarity between two procedures. The notion of similarity makes it possible to find similar code even when it has been compiled using different compilers, or has been slightly modified.

The main idea is to use *similarity
by composition*: decompose the code into smaller comparable fragments,
define *semantic* similarity between fragments, and use statistical
reasoning to lift fragment similarity into similarity between procedures.

All of the approaches specified in this work are evaluated using real-world code projects. The evaluation shows the benefit of each approach and its ability to produce a sound and precise description of difference and similarity.