AceReason-Nemotron-14B / README_EVALUATION.md
ychenNLP's picture
Update README_EVALUATION.md
8212cdd verified

AceReason Evaluation Toolkit

We share our evaluation script and code in https://huggingface.co/nvidia/AceReason-Nemotron-14B/blob/main/evaluation.tar.gz

Environment

  • vllm==0.7.3
  • torch==2.5.1
  • transformers==4.48.2
  • 8x NVIDIA H100 80GB HBM3 (CUDA Version: 12.8)

Dataset Download

LiveCodeBench:

from datasets import load_dataset

ds = load_dataset(
    "livecodebench/code_generation_lite",
    version_tag="release_v6",
)["test"]

ds.to_json("data/livecodebench_problems.json", orient="records", lines=False)

Math: see data/*

Evaluation Script

For model generation on single seed, please use the following command:

bash generate_livecodebench.sh ${model_path} ${seed} ${output_path} ${model_type}
bash generate_aime.sh ${model_path} ${seed} aime24 ${output_path} ${model_type}
bash generate_aime.sh ${model_path} ${seed} aime25 ${output_path} ${model_type}

Please specify model_type as r1 for AceReason-Nemotron-1.0 models, and qwen for AceReason-Nemotron-1.1 models.

Or you can use our configured seeds to reproduce our results on AIME 24/25 (avg@64) and LiveCodeBench v5/v6 (avg@8) as follows:

bash run_livecodebench.sh ${model_path} ${output_path}
bash run_aime.sh ${model_path} ${output_path}

For benchmark evaluation, we provide the following evaluation command to reproduce our results:

python evaluate_livecodebench.py -g ${output_path}
python evaluate_aime.py --modelfolder ${output_path} --test_data data/aime24.jsonl
python evaluate_aime.py --modelfolder ${output_path} --test_data data/aime25.jsonl

Reference Results

We also left our generations into cache.tar.gz as references.

LiveCodeBench AceReason-Nemotron-1.0-7B (Avg@8)
=================================================================
Months          Corrects        Total           Accuracy
2023-05         180             272             66.17647058823529
2023-06         238             312             76.28205128205128
2023-07         337             432             78.00925925925925
2023-08         185             288             64.23611111111111
2023-09         275             352             78.125
2023-10         257             352             73.01136363636364
2023-11         217             280             77.5
2023-12         228             320             71.25
2024-01         193             288             67.01388888888889
2024-02         169             256             66.015625
2024-03         234             360             65.0
2024-04         226             296             76.35135135135135
2024-05         211             288             73.26388888888889
05/23-05/24     2950            4096            72.021484375
2024-06         277             368             75.27173913043478
2024-07         223             344             64.82558139534883
2024-08         275             528             52.083333333333336
2024-09         204             376             54.255319148936174
2024-10         209             424             49.29245283018868
2024-11         216             456             47.36842105263158
2024-12         223             392             56.88775510204081
2025-01         161             408             39.46078431372549
06/24-01/25     1788            3296            54.24757281553398
2025-02         179             408             43.872549019607845
2025-03         258             544             47.4264705882353
2025-04         38              96              39.583333333333336
v5              1142            2232            51.16487455197132
v6              621             1400            44.357142857142854

LiveCodeBench AceReason-Nemotron-1.0-14B (Avg@8)
=================================================================
Months          Corrects        Total           Accuracy
2023-05         211             272             77.57352941176471
2023-06         282             312             90.38461538461539
2023-07         393             432             90.97222222222223
2023-08         219             288             76.04166666666667
2023-09         315             352             89.48863636363636
2023-10         294             352             83.52272727272727
2023-11         229             280             81.78571428571429
2023-12         263             320             82.1875
2024-01         219             288             76.04166666666667
2024-02         201             256             78.515625
2024-03         296             360             82.22222222222223
2024-04         252             296             85.13513513513513
2024-05         233             288             80.90277777777777
05/23-05/24     3407            4096            83.1787109375
2024-06         311             368             84.51086956521739
2024-07         248             344             72.09302325581395
2024-08         299             528             56.628787878787875
2024-09         232             376             61.702127659574465
2024-10         266             424             62.735849056603776
2024-11         282             456             61.8421052631579
2024-12         253             392             64.54081632653062
2025-01         217             408             53.18627450980392
06/24-01/25     2108            3296            63.95631067961165
2025-02         211             408             51.71568627450981
2025-03         324             544             59.55882352941177
2025-04         41              96              42.708333333333336
v5              1350            2232            60.483870967741936
v6              775             1400            55.357142857142854

LiveCodeBench AceReason-Nemotron-1.1-7B (Avg@8)
=================================================================
Months          Corrects        Total           Accuracy
2023-05         205             272             75.36764705882354
2023-06         255             312             81.73076923076923
2023-07         356             432             82.4074074074074
2023-08         208             288             72.22222222222223
2023-09         287             352             81.5340909090909
2023-10         278             352             78.97727272727273
2023-11         234             280             83.57142857142857
2023-12         263             320             82.1875
2024-01         215             288             74.65277777777777
2024-02         182             256             71.09375
2024-03         270             360             75.0
2024-04         254             296             85.8108108108108
2024-05         221             288             76.73611111111111
05/23-05/24     3228            4096            78.80859375
2024-06         309             368             83.96739130434783
2024-07         235             344             68.31395348837209
2024-08         292             528             55.303030303030305
2024-09         211             376             56.11702127659574
2024-10         254             424             59.905660377358494
2024-11         269             456             58.99122807017544
2024-12         239             392             60.96938775510204
2025-01         194             408             47.549019607843135
06/24-01/25     2003            3296            60.77063106796116
2025-02         203             408             49.754901960784316
2025-03         306             544             56.25
2025-04         41              96              42.708333333333336
v5              1283            2232            57.482078853046595
v6              726             1400            51.857142857142854

AceReason-Nemotron-7B
====================================
AIME2024 (Avg@64) 68.64583333333334
AIME2025 (Avg@64) 53.59375000000002

AceReason-Nemotron-14B
====================================
AIME2024 (Avg@64) 78.43749999999997
AIME2025 (Avg@64) 67.65625

AceReason-Nemotron-1.1-7B
====================================
AIME2024 (Avg@64) 72.60416666666667
AIME2025 (Avg@64) 64.84375