January 27, 2021
Decompilers convert low-level executable code, like assembly instructions, back to a high-level programming language, like C++, that is easier for people to read. They’re useful for detecting vulnerabilities and anomalies in computer security as well as for forensics. Decompilers can also be leveraged to find potential viruses, debug programs, translate obsolete code, recover lost source code, and more. A decompiler program is traditionally manually designed with heuristics from human experts. So, for every pair of programming languages (e.g., C++ and assembly), a domain expert would write down a large number of rules — a time-consuming process that could take years to produce and require more careful attention and manipulation in tricky situations. An upgrade of the source language, for example from C++03 to C++11, also leads to non-trivial maintenance work.
In this blogpost, we propose N-Bref, a neural-based decompiler framework that improves on the performance accuracy of previous systems of traditional decompilation. This is a joint collaboration between FAIR and UCSD STABLE Lab led by Jishen Zhao.
N-Bref automates the design flow from data set generation to neural network training and evaluation with no human engineer required. It is the first to repurpose state-of-the-art neural networks, such as Transformers used in neural machine translation, to handle the highly structured input and output data in realistic code decompilation tasks. N-Bref works on the assembly code compiled from generated C++ programs that routinely call standard libraries (e.g., <string.h>, <math.h>), as well as simple real codebase-like solutions.
N-Bref outperforms traditional decompilers (e.g., REWARD ), especially when the input program is long and has sophisticated control flows. It also outperforms our previous work (which does not use Transformers). Of note is that our system can decompile real world C code from standard C library (e.g., <math.h>, <string.h>) and basic code bases written by humans to solve real problems. Our research presents a comprehensive analysis of how each component of a neural-based decompiler design influences the overall accuracy of program recovery across different data set configurations.
We begin by encoding the input assembly code into a graph structure to better represent the relationships between distinct instructions. Then, we encode the graph structure using existing Graph Embedding tools (GraphSage ) to obtain representations of the assembly code. To build and iteratively refine the abstract syntax (AST) tree that encodes the high-level semantic of the code, we use memory-augmented transformers that handle highly structured assembly code. Finally, we convert the AST tree into an actual high-level semantic language, such as C. To collect training data, we also provide a tool that generates and unifies the representation of high-level programming languages for neural decompiler research.
To our knowledge, this is the first time that an end-to-end trainable code decompiler system has performed reasonably well in extensively used programming languages such as C++. This advancement moves the field one step closer to a practical decompiler system that can operate on a large-scale codebase. Our team also developed the first data set generation tool for neural-based decompiler development and testing that generates code close to the ones written by human programmers. The tool is also suitable for developing learning-based methodologies.
The code as well as the data generation tools have been open sourced on Github.