Update README to describe realworld set and training command

Browse files

Files changed (3) hide show

README.md +44 -0
real_world_dataset/generate.sh +2 -2
real_world_dataset/preprocess.sh +0 -21

README.md CHANGED Viewed

@@ -29,6 +29,20 @@ The tokenized files are also preprocessed using `fairseq-preprocess`.
 Extract only `tokenized.zip` if you just want to use the synthetic data to train new models.
 Extract the `dataset.zip` if you want to tokenize in a different way or want to modify the data before processing.
 ## Replicate the synthetic dataset
 Following are the steps to recreate the dataset present in `dataset.zip` and `tokenized.zip`
@@ -38,3 +52,33 @@ Following are the steps to recreate the dataset present in `dataset.zip` and `to
 3. Compile the equations and disassemble them via the REMEND disassembler: `./compile.sh`
 4. Combine the compiled equations and assembly files, remove duplicates, and split: `./combine.sh`
 5. Tokenize the split assembly files: `./tokenize.sh`

 Extract only `tokenized.zip` if you just want to use the synthetic data to train new models.
 Extract the `dataset.zip` if you want to tokenize in a different way or want to modify the data before processing.
+## Using the real world dataset
+Follow these steps to compile and evaluate the real-world dataset
+1. Run `make.sh` to compile the source files into ELF and assembly files
+2. Run `python3 collect_dataset.py` to disassemble the ELF functions for REMEND processing
+3. Run `generate.sh` to run REMEND, generate equations, and evaluate them for correctness
+The dataset will be present in `dataset/<arch>.[eqn,asm]`.
+The results will be present in `generated/base/<arch>_res_<beamsize>.txt`.
+The folder `real_world_dataset/related_evals` contains scripts to evaluate related works BTC, SLaDE, and Nova.
+Each of the related works need to be setup before evaluating. See each script for further instructions.
 ## Replicate the synthetic dataset
 Following are the steps to recreate the dataset present in `dataset.zip` and `tokenized.zip`
 3. Compile the equations and disassemble them via the REMEND disassembler: `./compile.sh`
 4. Combine the compiled equations and assembly files, remove duplicates, and split: `./combine.sh`
 5. Tokenize the split assembly files: `./tokenize.sh`
+## Training command
+The [fairseq](https://github.com/facebookresearch/fairseq) library is used for training the Transformer.
+The following command with different parameters is used to train each model.
+```
+fairseq-train <tokenized-dataset> --task translation --arch transformer \
+    --optimizer adam --weight-decay 0.001 --lr 0.0005 --lr-scheduler inverse_sqrt \
+    --max-source-positions 1024 --max-target-positions 1024 \
+    --encoder-attention-heads 8 --decoder-attention-heads 8 --encoder-embed-dim 384 --decoder-embed-dim 128 \
+    --encoder-ffn-embed-dim 1536 --decoder-ffn-embed-dim 512 --decoder-output-dim 128 --dropout 0.05 \
+    --max-tokens 20000 --max-update 100000 \
+    --no-epoch-checkpoints --keep-best-checkpoints 3 \
+    --save-dir <save-dir> --log-file <save-dir>/training.log
+```
+The following command runs a trained model and generates translations:
+```
+fairseq-generate <tokenized-dataset> --task translation --arch transformer \
+    --max-source-positions 1024 --max-target-positions 1024 \
+    --path <checkpoint> --results-path <out-dir> --gen-subset <train/test/valid> --beam 1
+```
+## Ablations
+The `ablations` folder contains the data for the 3 ablations present in the paper: REMEND without constant identification, REMEND with equations in postfix, and REMEND trained with REMaQE data included.

real_world_dataset/generate.sh CHANGED Viewed

@@ -1,8 +1,8 @@
 #!/bin/bash
 ARCHS=( arm32 aarch64 x64 )
-TOKENIZERS=$HOME/projects/decode_ML/dlsym/tokenized
-MODELS=$HOME/projects/decode_ML/dlsym/ablation
 MODEL=base
 DS=dataset
 GEN=generated/${MODEL}

 #!/bin/bash
 ARCHS=( arm32 aarch64 x64 )
+TOKENIZERS=../tokenized
+MODELS=../models
 MODEL=base
 DS=dataset
 GEN=generated/${MODEL}

real_world_dataset/preprocess.sh DELETED Viewed

@@ -1,21 +0,0 @@
-#!/bin/bash
-ARCHS=( arm32 aarch64 x64 )
-TOKENIZERS=$HOME/projects/decode_ML/dlsym/tokenized
-MODELS=$HOME/projects/decode_ML/dlsym/ablation
-MODEL=base
-DS=dataset
-GEN=generated/${MODEL}
-mkdir -p ${GEN}
-for arch in ${ARCHS[@]}
-do
-    tok=${TOKENIZERS}/${arch}/tokenized_dlsm_${arch}
-    echo python3 -m remend.tools.bpe_apply -t ${tok}/asm_tokens.json -i ${DS}/${arch}.asm -o ${GEN}/${arch}_tokenized.asm
-    python3 -m remend.tools.bpe_apply -t ${tok}/asm_tokens.json -i ${DS}/${arch}.asm -o ${GEN}/${arch}_tokenized.asm
-    fairseq-interactive ${tok} --beam 1 --path ${MODELS}/trained_${arch}_${MODEL}/checkpoint_best.pt < ${GEN}/${arch}_tokenized.asm > ${GEN}/${arch}_generated_beam1.txt 2>/dev/null
-    fairseq-interactive ${tok} --beam 5 --path ${MODELS}/trained_${arch}_${MODEL}/checkpoint_best.pt < ${GEN}/${arch}_tokenized.asm > ${GEN}/${arch}_generated_beam5.txt 2>/dev/null
-    python3 eval_dataset.py -g ${GEN}/${arch}_generated_beam1.txt -i ${DS}/${arch}.info -r ${GEN}/${arch}_res_beam1.txt
-    python3 eval_dataset.py -g ${GEN}/${arch}_generated_beam5.txt -i ${DS}/${arch}.info -r ${GEN}/${arch}_res_beam5.txt
-done