Research

Introducing N-Bref: a neural-based decompiler framework

January 27, 2021

What the research is:

Decompilers convert low-level executable code, like assembly instructions, back to a high-level programming language, like C++, that is easier for people to read. They’re useful for detecting vulnerabilities and anomalies in computer security as well as for forensics. Decompilers can also be leveraged to find potential viruses, debug programs, translate obsolete code, recover lost source code, and more. A decompiler program is traditionally manually designed with heuristics from human experts. So, for every pair of programming languages (e.g., C++ and assembly), a domain expert would write down a large number of rules — a time-consuming process that could take years to produce and require more careful attention and manipulation in tricky situations. An upgrade of the source language, for example from C++03 to C++11, also leads to non-trivial maintenance work.

In this blogpost, we propose N-Bref, a neural-based decompiler framework that improves on the performance accuracy of previous systems of traditional decompilation. This is a joint collaboration between FAIR and UCSD STABLE Lab led by Jishen Zhao.

N-Bref automates the design flow from data set generation to neural network training and evaluation with no human engineer required. It is the first to repurpose state-of-the-art neural networks, such as Transformers used in neural machine translation, to handle the highly structured input and output data in realistic code decompilation tasks. N-Bref works on the assembly code compiled from generated C++ programs that routinely call standard libraries (e.g., <string.h>, <math.h>), as well as simple real codebase-like solutions.

This table compares the accuracy of N-Bref to previous methods using two metrics: (a) type recovery and (b) AST generation. Ins2AST is our previous neural-based program decompiler without using transformers. REWARD is an expert-designed tool for type recovery. Lang2logic is a sequence-to-tree translator. We also provide the baseline using a vanilla transformer.

N-Bref outperforms traditional decompilers (e.g., REWARD [2]), especially when the input program is long and has sophisticated control flows. It also outperforms our previous work (which does not use Transformers). Of note is that our system can decompile real world C code from standard C library (e.g., <math.h>, <string.h>) and basic code bases written by humans to solve real problems. Our research presents a comprehensive analysis of how each component of a neural-based decompiler design influences the overall accuracy of program recovery across different data set configurations.

How it works:

We begin by encoding the input assembly code into a graph structure to better represent the relationships between distinct instructions. Then, we encode the graph structure using existing Graph Embedding tools (GraphSage [1]) to obtain representations of the assembly code. To build and iteratively refine the abstract syntax (AST) tree that encodes the high-level semantic of the code, we use memory-augmented transformers that handle highly structured assembly code. Finally, we convert the AST tree into an actual high-level semantic language, such as C. To collect training data, we also provide a tool that generates and unifies the representation of high-level programming languages for neural decompiler research.

Why it matters:

To our knowledge, this is the first time that an end-to-end trainable code decompiler system has performed reasonably well in extensively used programming languages such as C++. This advancement moves the field one step closer to a practical decompiler system that can operate on a large-scale codebase. Our team also developed the first data set generation tool for neural-based decompiler development and testing that generates code close to the ones written by human programmers. The tool is also suitable for developing learning-based methodologies.

The code as well as the data generation tools have been open sourced on Github.

Written By

Yuandong Tian

Research Scientist

Cheng Fu

PhD Student