metadata

license: mit
language:
  - en
metrics:
  - accuracy
pipeline_tag: translation
library_name: fairseq
tags:
  - code
  - decompiler
  - symbolic-math
  - neural-decompiler

REMEND: Neural Decompilation for Reverse Engineering Math Equations from Binary Executables

This repo contains the synthetic and real-world datasets, along with the trained models for the REMEND paper.

Abstract

Analysis of binary executables implementing mathematical equations can benefit from the reverse engineering of semantic information about the implementation. Traditional algorithmic reverse engineering tools either do not recover semantic information or rely on dynamic analysis and symbolic execution with high reverse engineering time. Algorithmic tools also require significant re-engineering effort to target new platforms and languages. Recently, neural methods for decompilation have been developed to recover human-like source code, but they do not extract semantic information explicitly. We develop REMEND, a neural decompilation framework to reverse engineer math equations from binaries to explicitly recover program semantics like data flow and order of operations. REMEND combines a transformer encoder-decoder model for neural decompilation with algorithmic processing for enhanced symbolic reasoning necessary for math equations. REMEND is the first work to demonstrate that transformers for neural decompilation go beyond source code and reason about program semantics in the form of math equations. We train on a synthetically generated dataset containing multiple implementations and compilations of math equations to produce a robust neural decompilation model and demonstrate retargettability. REMEND obtains an accuracy of 89.8% to 92.4% across three Instruction Set Architectures (ISAs), three optimization levels, and two programming languages with a single trained model, extending the capability of state-of-the-art neural decompilers. We achieve high accuracy with a small model of upto 12 million parameters and an average execution time of 0.132 seconds per function. On a real-world dataset collected from open-source programs, REMEND generalizes better than state-of-the-art neural decompilers despite being trained with synthetic data, achieving 8% higher accuracy.

Using the synthetic dataset

The compiled equations and assembly are present in dataset.zip. The assembly are tokenized and present in tokenized.zip. The tokenized files are also preprocessed using fairseq-preprocess.

Extract only tokenized.zip if you just want to use the synthetic data to train new models. Extract the dataset.zip if you want to tokenize in a different way or want to modify the data before processing.

Replicate the synthetic dataset

Following are the steps to recreate the dataset present in dataset.zip and tokenized.zip

Download the FAIR dataset from paper "Deep Learning for Symbolic Math" prim fwd.tar.gz (this is a large file)
Extract the downloaded data: tar xvf prim_fwd.tar.gz
Compile the equations and disassemble them via the REMEND disassembler: ./compile.sh
Combine the compiled equations and assembly files, remove duplicates, and split: ./combine.sh
Tokenize the split assembly files: ./tokenize.sh