Ph.D Thesis

Ph.D StudentKatz Omer
SubjectStatistical Approaches to Reverse Engineering
DepartmentDepartment of Computer Science
Supervisor PROF. Eran Yahav
Full Thesis textFull thesis text - English Version


Software affects all aspects of our lives. More and more industries and products incorporate software as a core component, and we routinely place our trust in software. If that software fails, either due to an attack or an error, it can have major implications for users, industries, and society. As the impact of software grows rapidly, it is imperative that we raise the level of confidence in its correctness.

One way to make sure that software is free from vulnerabilities and back-doors is to understand how it works, for example by inspecting its source code. However, the vast majority of software is delivered in binary form, without source code. When source code is not available, reverse engineering is required to lift software in low-level representation (typically machine code) to higher-level presentation that can be understood by humans. Reverse engineering is notoriously hard and requires extensive manual effort by experts.

This thesis focuses on techniques that assist reverse engineering. We tackle several reverse engineering challenges and provide statistical solutions. Some previous advances on these problems have been made but a solution applicable to realistic software remained unachievable, often due to computational complexity. The main idea behind our statistical solutions is to trade off precision for computational feasibility. These solutions are effective in focusing and directing reverse engineering efforts, and can be directly used for improving the level of software security.

We first discuss the problem of identifying targets of virtual function calls in optimized stripped binaries. This is a crucial step towards understanding the control flow of the binary. It is crucial both for manual reverse engineering and for automatic analysis. Our solution relies on object tracelets --- sequences of operations applied to an object --- which represent how an instance of a type is used in practice, and the subtle differences between usages of different types.

We then tackle the problem of reconstructing type hierarchies in optimized stripped binaries. We measure a similarity between types based on similarity of usage patterns, which we elevate to a full hierarchy. Such a hierarchy is needed for understanding a given binary. Furthermore, it is essential for augmenting existing binaries with security mechanisms such as control flow integrity.

Finally, we examine the challenge of automatic decompilation. This challenge is the pinnacle of reverse engineering. Automatically decompiling low-level code back to human-readable equivalent code makes it accessible to a wider range of programmers. It also allows modification and recompilation of the code as needed. We present a new approach and a new point-of-view for this problem by framing it as a language translation problem.

Throughout this thesis, we borrow insights and methods from machine learning and natural language processing. We combine those with programming languages techniques, resulting in solutions much more powerful than achievable by each field separately.

The approaches specified in this thesis were implemented as prototypes and evaluated using real-world binaries and code. Our evaluation shows the benefits of each approach compared to traditional solutions, and its effectiveness in understanding and securing software.