What is the Time Range Selected in the LCB V6 Benchmark?
The capabilities of your glm-4.6 model are truly remarkable. We have recently been looking to test the model's performance across various domains, so we first evaluated it using the LiveCodeBench benchmark.
However, we found that the scores achieved were not as high as those stated in the technical report. Even using pypy interpreter it gives the following evaluation result:
2024-08-01-2025-02-01-Pass@1: 0.759878
2024-08-01-2025-02-01-Easy Pass@1: 0.987654
2024-08-01-2025-02-01-Medium Pass@1: 0.864078
2024-08-01-2025-02-01-Hard Pass@1: 0.558621
2024-08-01-2025-05-01-Pass@1: 0.766520
2024-08-01-2025-05-01-Easy Pass@1: 0.990909
2024-08-01-2025-05-01-Medium Pass@1: 0.858156
2024-08-01-2025-05-01-Hard Pass@1: 0.581281
2025-01-01-2025-02-01-Pass@1: 0.754386
2025-01-01-2025-02-01-Easy Pass@1: 1
2025-01-01-2025-02-01-Medium Pass@1: 0.882353
2025-01-01-2025-02-01-Hard Pass@1: 0.500000
2025-01-01-2025-05-01-Pass@1: 0.774725
2025-01-01-2025-05-01-Easy Pass@1: 1
2025-01-01-2025-05-01-Medium Pass@1: 0.854545
2025-01-01-2025-05-01-Hard Pass@1: 0.597561
2025-02-01-2025-05-01-Pass@1: 0.793893
2025-02-01-2025-05-01-Easy Pass@1: 1
2025-02-01-2025-05-01-Medium Pass@1: 0.846154
2025-02-01-2025-05-01-Hard Pass@1: 0.655738
We can't reproduce the score(82.8) in tech report, so we need your help.
We are aware that LiveCodeBench includes a time range selection feature, and scores can vary depending on the chosen time range. Therefore, we would like to inquire about the specific time range you used.
Alternatively, would it be possible for you to share the intermediate evaluation result files from your testing process with the open-source community?
Thank you very much for your interest and support for the GLM-4.6 model.
Our evaluation on LiveCodeBench (LCB) was conducted using the v6 version of the dataset, which can be found here:
π https://huggingface.co/datasets/livecodebench/code_generation_lite/blob/main/test6.jsonl
For our internal tests, we used the Sglang FP8 block-quantized model for both serving and evaluation. The sampling parameters were as follows:
- Temperature: 1.0
- Top-p: 0.95
- Top-k: 40
- Max generation length: 128,000
Please note that the online deployment at bigmodel.cn may experience intermittent interruptions under high traffic, which can lead to slightly lower benchmark results. We are actively addressing this issue. In the meantime, we recommend using the FP8 checkpoint with Sglang for local deployment and evaluation to ensure stable performance.
To further improve reproducibility for the research community, we will soon organize and release our evaluation scripts at
π https://github.com/zai-org/glm-simple-evals,
so that others can more easily reproduce our experiments.