udiboy1209 commited on
Commit
7e94263
·
1 Parent(s): d99a73a

Update README to describe realworld set and training command

Browse files
README.md CHANGED
@@ -29,6 +29,20 @@ The tokenized files are also preprocessed using `fairseq-preprocess`.
29
  Extract only `tokenized.zip` if you just want to use the synthetic data to train new models.
30
  Extract the `dataset.zip` if you want to tokenize in a different way or want to modify the data before processing.
31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ## Replicate the synthetic dataset
33
 
34
  Following are the steps to recreate the dataset present in `dataset.zip` and `tokenized.zip`
@@ -38,3 +52,33 @@ Following are the steps to recreate the dataset present in `dataset.zip` and `to
38
  3. Compile the equations and disassemble them via the REMEND disassembler: `./compile.sh`
39
  4. Combine the compiled equations and assembly files, remove duplicates, and split: `./combine.sh`
40
  5. Tokenize the split assembly files: `./tokenize.sh`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  Extract only `tokenized.zip` if you just want to use the synthetic data to train new models.
30
  Extract the `dataset.zip` if you want to tokenize in a different way or want to modify the data before processing.
31
 
32
+ ## Using the real world dataset
33
+
34
+ Follow these steps to compile and evaluate the real-world dataset
35
+
36
+ 1. Run `make.sh` to compile the source files into ELF and assembly files
37
+ 2. Run `python3 collect_dataset.py` to disassemble the ELF functions for REMEND processing
38
+ 3. Run `generate.sh` to run REMEND, generate equations, and evaluate them for correctness
39
+
40
+ The dataset will be present in `dataset/<arch>.[eqn,asm]`.
41
+ The results will be present in `generated/base/<arch>_res_<beamsize>.txt`.
42
+
43
+ The folder `real_world_dataset/related_evals` contains scripts to evaluate related works BTC, SLaDE, and Nova.
44
+ Each of the related works need to be setup before evaluating. See each script for further instructions.
45
+
46
  ## Replicate the synthetic dataset
47
 
48
  Following are the steps to recreate the dataset present in `dataset.zip` and `tokenized.zip`
 
52
  3. Compile the equations and disassemble them via the REMEND disassembler: `./compile.sh`
53
  4. Combine the compiled equations and assembly files, remove duplicates, and split: `./combine.sh`
54
  5. Tokenize the split assembly files: `./tokenize.sh`
55
+
56
+ ## Training command
57
+
58
+ The [fairseq](https://github.com/facebookresearch/fairseq) library is used for training the Transformer.
59
+ The following command with different parameters is used to train each model.
60
+
61
+
62
+ ```
63
+ fairseq-train <tokenized-dataset> --task translation --arch transformer \
64
+ --optimizer adam --weight-decay 0.001 --lr 0.0005 --lr-scheduler inverse_sqrt \
65
+ --max-source-positions 1024 --max-target-positions 1024 \
66
+ --encoder-attention-heads 8 --decoder-attention-heads 8 --encoder-embed-dim 384 --decoder-embed-dim 128 \
67
+ --encoder-ffn-embed-dim 1536 --decoder-ffn-embed-dim 512 --decoder-output-dim 128 --dropout 0.05 \
68
+ --max-tokens 20000 --max-update 100000 \
69
+ --no-epoch-checkpoints --keep-best-checkpoints 3 \
70
+ --save-dir <save-dir> --log-file <save-dir>/training.log
71
+ ```
72
+
73
+ The following command runs a trained model and generates translations:
74
+
75
+ ```
76
+ fairseq-generate <tokenized-dataset> --task translation --arch transformer \
77
+ --max-source-positions 1024 --max-target-positions 1024 \
78
+ --path <checkpoint> --results-path <out-dir> --gen-subset <train/test/valid> --beam 1
79
+ ```
80
+
81
+ ## Ablations
82
+
83
+ The `ablations` folder contains the data for the 3 ablations present in the paper: REMEND without constant identification, REMEND with equations in postfix, and REMEND trained with REMaQE data included.
84
+
real_world_dataset/generate.sh CHANGED
@@ -1,8 +1,8 @@
1
  #!/bin/bash
2
 
3
  ARCHS=( arm32 aarch64 x64 )
4
- TOKENIZERS=$HOME/projects/decode_ML/dlsym/tokenized
5
- MODELS=$HOME/projects/decode_ML/dlsym/ablation
6
  MODEL=base
7
  DS=dataset
8
  GEN=generated/${MODEL}
 
1
  #!/bin/bash
2
 
3
  ARCHS=( arm32 aarch64 x64 )
4
+ TOKENIZERS=../tokenized
5
+ MODELS=../models
6
  MODEL=base
7
  DS=dataset
8
  GEN=generated/${MODEL}
real_world_dataset/preprocess.sh DELETED
@@ -1,21 +0,0 @@
1
- #!/bin/bash
2
-
3
- ARCHS=( arm32 aarch64 x64 )
4
- TOKENIZERS=$HOME/projects/decode_ML/dlsym/tokenized
5
- MODELS=$HOME/projects/decode_ML/dlsym/ablation
6
- MODEL=base
7
- DS=dataset
8
- GEN=generated/${MODEL}
9
-
10
- mkdir -p ${GEN}
11
-
12
- for arch in ${ARCHS[@]}
13
- do
14
- tok=${TOKENIZERS}/${arch}/tokenized_dlsm_${arch}
15
- echo python3 -m remend.tools.bpe_apply -t ${tok}/asm_tokens.json -i ${DS}/${arch}.asm -o ${GEN}/${arch}_tokenized.asm
16
- python3 -m remend.tools.bpe_apply -t ${tok}/asm_tokens.json -i ${DS}/${arch}.asm -o ${GEN}/${arch}_tokenized.asm
17
- fairseq-interactive ${tok} --beam 1 --path ${MODELS}/trained_${arch}_${MODEL}/checkpoint_best.pt < ${GEN}/${arch}_tokenized.asm > ${GEN}/${arch}_generated_beam1.txt 2>/dev/null
18
- fairseq-interactive ${tok} --beam 5 --path ${MODELS}/trained_${arch}_${MODEL}/checkpoint_best.pt < ${GEN}/${arch}_tokenized.asm > ${GEN}/${arch}_generated_beam5.txt 2>/dev/null
19
- python3 eval_dataset.py -g ${GEN}/${arch}_generated_beam1.txt -i ${DS}/${arch}.info -r ${GEN}/${arch}_res_beam1.txt
20
- python3 eval_dataset.py -g ${GEN}/${arch}_generated_beam5.txt -i ${DS}/${arch}.info -r ${GEN}/${arch}_res_beam5.txt
21
- done