Commit
·
7e94263
1
Parent(s):
d99a73a
Update README to describe realworld set and training command
Browse files- README.md +44 -0
- real_world_dataset/generate.sh +2 -2
- real_world_dataset/preprocess.sh +0 -21
README.md
CHANGED
@@ -29,6 +29,20 @@ The tokenized files are also preprocessed using `fairseq-preprocess`.
|
|
29 |
Extract only `tokenized.zip` if you just want to use the synthetic data to train new models.
|
30 |
Extract the `dataset.zip` if you want to tokenize in a different way or want to modify the data before processing.
|
31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
## Replicate the synthetic dataset
|
33 |
|
34 |
Following are the steps to recreate the dataset present in `dataset.zip` and `tokenized.zip`
|
@@ -38,3 +52,33 @@ Following are the steps to recreate the dataset present in `dataset.zip` and `to
|
|
38 |
3. Compile the equations and disassemble them via the REMEND disassembler: `./compile.sh`
|
39 |
4. Combine the compiled equations and assembly files, remove duplicates, and split: `./combine.sh`
|
40 |
5. Tokenize the split assembly files: `./tokenize.sh`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
Extract only `tokenized.zip` if you just want to use the synthetic data to train new models.
|
30 |
Extract the `dataset.zip` if you want to tokenize in a different way or want to modify the data before processing.
|
31 |
|
32 |
+
## Using the real world dataset
|
33 |
+
|
34 |
+
Follow these steps to compile and evaluate the real-world dataset
|
35 |
+
|
36 |
+
1. Run `make.sh` to compile the source files into ELF and assembly files
|
37 |
+
2. Run `python3 collect_dataset.py` to disassemble the ELF functions for REMEND processing
|
38 |
+
3. Run `generate.sh` to run REMEND, generate equations, and evaluate them for correctness
|
39 |
+
|
40 |
+
The dataset will be present in `dataset/<arch>.[eqn,asm]`.
|
41 |
+
The results will be present in `generated/base/<arch>_res_<beamsize>.txt`.
|
42 |
+
|
43 |
+
The folder `real_world_dataset/related_evals` contains scripts to evaluate related works BTC, SLaDE, and Nova.
|
44 |
+
Each of the related works need to be setup before evaluating. See each script for further instructions.
|
45 |
+
|
46 |
## Replicate the synthetic dataset
|
47 |
|
48 |
Following are the steps to recreate the dataset present in `dataset.zip` and `tokenized.zip`
|
|
|
52 |
3. Compile the equations and disassemble them via the REMEND disassembler: `./compile.sh`
|
53 |
4. Combine the compiled equations and assembly files, remove duplicates, and split: `./combine.sh`
|
54 |
5. Tokenize the split assembly files: `./tokenize.sh`
|
55 |
+
|
56 |
+
## Training command
|
57 |
+
|
58 |
+
The [fairseq](https://github.com/facebookresearch/fairseq) library is used for training the Transformer.
|
59 |
+
The following command with different parameters is used to train each model.
|
60 |
+
|
61 |
+
|
62 |
+
```
|
63 |
+
fairseq-train <tokenized-dataset> --task translation --arch transformer \
|
64 |
+
--optimizer adam --weight-decay 0.001 --lr 0.0005 --lr-scheduler inverse_sqrt \
|
65 |
+
--max-source-positions 1024 --max-target-positions 1024 \
|
66 |
+
--encoder-attention-heads 8 --decoder-attention-heads 8 --encoder-embed-dim 384 --decoder-embed-dim 128 \
|
67 |
+
--encoder-ffn-embed-dim 1536 --decoder-ffn-embed-dim 512 --decoder-output-dim 128 --dropout 0.05 \
|
68 |
+
--max-tokens 20000 --max-update 100000 \
|
69 |
+
--no-epoch-checkpoints --keep-best-checkpoints 3 \
|
70 |
+
--save-dir <save-dir> --log-file <save-dir>/training.log
|
71 |
+
```
|
72 |
+
|
73 |
+
The following command runs a trained model and generates translations:
|
74 |
+
|
75 |
+
```
|
76 |
+
fairseq-generate <tokenized-dataset> --task translation --arch transformer \
|
77 |
+
--max-source-positions 1024 --max-target-positions 1024 \
|
78 |
+
--path <checkpoint> --results-path <out-dir> --gen-subset <train/test/valid> --beam 1
|
79 |
+
```
|
80 |
+
|
81 |
+
## Ablations
|
82 |
+
|
83 |
+
The `ablations` folder contains the data for the 3 ablations present in the paper: REMEND without constant identification, REMEND with equations in postfix, and REMEND trained with REMaQE data included.
|
84 |
+
|
real_world_dataset/generate.sh
CHANGED
@@ -1,8 +1,8 @@
|
|
1 |
#!/bin/bash
|
2 |
|
3 |
ARCHS=( arm32 aarch64 x64 )
|
4 |
-
TOKENIZERS
|
5 |
-
MODELS
|
6 |
MODEL=base
|
7 |
DS=dataset
|
8 |
GEN=generated/${MODEL}
|
|
|
1 |
#!/bin/bash
|
2 |
|
3 |
ARCHS=( arm32 aarch64 x64 )
|
4 |
+
TOKENIZERS=../tokenized
|
5 |
+
MODELS=../models
|
6 |
MODEL=base
|
7 |
DS=dataset
|
8 |
GEN=generated/${MODEL}
|
real_world_dataset/preprocess.sh
DELETED
@@ -1,21 +0,0 @@
|
|
1 |
-
#!/bin/bash
|
2 |
-
|
3 |
-
ARCHS=( arm32 aarch64 x64 )
|
4 |
-
TOKENIZERS=$HOME/projects/decode_ML/dlsym/tokenized
|
5 |
-
MODELS=$HOME/projects/decode_ML/dlsym/ablation
|
6 |
-
MODEL=base
|
7 |
-
DS=dataset
|
8 |
-
GEN=generated/${MODEL}
|
9 |
-
|
10 |
-
mkdir -p ${GEN}
|
11 |
-
|
12 |
-
for arch in ${ARCHS[@]}
|
13 |
-
do
|
14 |
-
tok=${TOKENIZERS}/${arch}/tokenized_dlsm_${arch}
|
15 |
-
echo python3 -m remend.tools.bpe_apply -t ${tok}/asm_tokens.json -i ${DS}/${arch}.asm -o ${GEN}/${arch}_tokenized.asm
|
16 |
-
python3 -m remend.tools.bpe_apply -t ${tok}/asm_tokens.json -i ${DS}/${arch}.asm -o ${GEN}/${arch}_tokenized.asm
|
17 |
-
fairseq-interactive ${tok} --beam 1 --path ${MODELS}/trained_${arch}_${MODEL}/checkpoint_best.pt < ${GEN}/${arch}_tokenized.asm > ${GEN}/${arch}_generated_beam1.txt 2>/dev/null
|
18 |
-
fairseq-interactive ${tok} --beam 5 --path ${MODELS}/trained_${arch}_${MODEL}/checkpoint_best.pt < ${GEN}/${arch}_tokenized.asm > ${GEN}/${arch}_generated_beam5.txt 2>/dev/null
|
19 |
-
python3 eval_dataset.py -g ${GEN}/${arch}_generated_beam1.txt -i ${DS}/${arch}.info -r ${GEN}/${arch}_res_beam1.txt
|
20 |
-
python3 eval_dataset.py -g ${GEN}/${arch}_generated_beam5.txt -i ${DS}/${arch}.info -r ${GEN}/${arch}_res_beam5.txt
|
21 |
-
done
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|