--- license: mit language: - en metrics: - accuracy pipeline_tag: translation library_name: fairseq tags: - code - decompiler - symbolic-math - neural-decompiler --- # REMEND: Neural Decompilation for Reverse Engineering Math Equations from Binary Executables This repo contains the synthetic and real-world datasets, along with the trained models for the REMEND paper. ## Abstract Analysis of binary executables implementing mathematical equations can benefit from the reverse engineering of semantic information about the implementation. Traditional algorithmic reverse engineering tools either do not recover semantic information or rely on dynamic analysis and symbolic execution with high reverse engineering time. Algorithmic tools also require significant re-engineering effort to target new platforms and languages. Recently, neural methods for decompilation have been developed to recover human-like source code, but they do not extract semantic information explicitly. We develop REMEND, a neural decompilation framework to reverse engineer math equations from binaries to explicitly recover program semantics like data flow and order of operations. REMEND combines a transformer encoder-decoder model for neural decompilation with algorithmic processing for enhanced symbolic reasoning necessary for math equations. REMEND is the first work to demonstrate that transformers for neural decompilation go beyond source code and reason about program semantics in the form of math equations. We train on a synthetically generated dataset containing multiple implementations and compilations of math equations to produce a robust neural decompilation model and demonstrate retargettability. REMEND obtains an accuracy of 89.8% to 92.4% across three Instruction Set Architectures (ISAs), three optimization levels, and two programming languages with a single trained model, extending the capability of state-of-the-art neural decompilers. We achieve high accuracy with a small model of upto 12 million parameters and an average execution time of 0.132 seconds per function. On a real-world dataset collected from open-source programs, REMEND generalizes better than state-of-the-art neural decompilers despite being trained with synthetic data, achieving 8% higher accuracy. ## Using the synthetic dataset The compiled equations and assembly are present in `dataset.zip`. The assembly are tokenized and present in `tokenized.zip`. The tokenized files are also preprocessed using `fairseq-preprocess`. Extract only `tokenized.zip` if you just want to use the synthetic data to train new models. Extract the `dataset.zip` if you want to tokenize in a different way or want to modify the data before processing. ## Using the real world dataset Follow these steps to compile and evaluate the real-world dataset 1. Run `make.sh` to compile the source files into ELF and assembly files 2. Run `python3 collect_dataset.py` to disassemble the ELF functions for REMEND processing 3. Run `generate.sh` to run REMEND, generate equations, and evaluate them for correctness The dataset will be present in `dataset/.[eqn,asm]`. The results will be present in `generated/base/_res_.txt`. The folder `real_world_dataset/related_evals` contains scripts to evaluate related works BTC, SLaDE, and Nova. Each of the related works need to be setup before evaluating. See each script for further instructions. ## Replicate the synthetic dataset Following are the steps to recreate the dataset present in `dataset.zip` and `tokenized.zip` 1. Download the FAIR dataset from paper "Deep Learning for Symbolic Math" [prim fwd.tar.gz](https://dl.fbaipublicfiles.com/SymbolicMathematics/data/prim_fwd.tar.gz) (this is a large file) 2. Extract the downloaded data: `tar xvf prim_fwd.tar.gz` 3. Compile the equations and disassemble them via the REMEND disassembler: `./compile.sh` 4. Combine the compiled equations and assembly files, remove duplicates, and split: `./combine.sh` 5. Tokenize the split assembly files: `./tokenize.sh` ## Training command The [fairseq](https://github.com/facebookresearch/fairseq) library is used for training the Transformer. The following command with different parameters is used to train each model. ``` fairseq-train --task translation --arch transformer \ --optimizer adam --weight-decay 0.001 --lr 0.0005 --lr-scheduler inverse_sqrt \ --max-source-positions 1024 --max-target-positions 1024 \ --encoder-attention-heads 8 --decoder-attention-heads 8 --encoder-embed-dim 384 --decoder-embed-dim 128 \ --encoder-ffn-embed-dim 1536 --decoder-ffn-embed-dim 512 --decoder-output-dim 128 --dropout 0.05 \ --max-tokens 20000 --max-update 100000 \ --no-epoch-checkpoints --keep-best-checkpoints 3 \ --save-dir --log-file /training.log ``` The following command runs a trained model and generates translations: ``` fairseq-generate --task translation --arch transformer \ --max-source-positions 1024 --max-target-positions 1024 \ --path --results-path --gen-subset --beam 1 ``` ## Ablations The `ablations` folder contains the data for the 3 ablations present in the paper: REMEND without constant identification, REMEND with equations in postfix, and REMEND trained with REMaQE data included.