Run inference with lm-evaluation-harness generated strange accuracy results
Hi, I set up transformer accordingly, and tried to use lm-evaluation-harness to generate some benchmark accuracy, but I got quite strange accuracy results, any clue what was the reason?
Running command:
python3 -m lm_eval --model hf --model_args pretrained=microsoft/bitnet-b1.58-2B-4T,dtype=float16 --tasks hellaswag,winogrande,piqa,gsm8k,truthfulqa --device cuda --batch_size 64
Running log:
############################################################################################
2025-05-01:18:19:32,020 INFO [main.py:308] Verbosity set to INFO
2025-05-01:18:19:32,164 INFO [init.py:491] group
and group_alias
keys in tasks' configs will no longer be used in the next release of lm-eval. tag
will be used to allow to call a collection of tasks just like group
. group
will be removed in order to not cause confusion with the new ConfigurableGroup which will be the offical way to create groups with addition of group-wide configuations.
2025-05-01:18:19:36,676 INFO [main.py:414] Selected Tasks: ['gsm8k', 'hellaswag', 'piqa', 'truthfulqa', 'winogrande']
2025-05-01:18:19:36,678 INFO [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2025-05-01:18:19:36,678 INFO [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'microsoft/bitnet-b1.58-2B-4T', 'dtype': 'float16'}
2025-05-01:18:19:36,733 WARNING [other.py:349] Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2025-05-01:18:19:36,734 INFO [huggingface.py:130] Using device 'cuda'
2025-05-01:18:19:37,561 INFO [huggingface.py:366] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
Downloading readme: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7.94k/7.94k [00:00<00:00, 13.6MB/s]
Downloading data: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2.31M/2.31M [00:00<00:00, 8.27MB/s]
Downloading data: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 419k/419k [00:00<00:00, 2.09MB/s]
Generating train split: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7473/7473 [00:00<00:00, 188559.36 examples/s]
Generating test split: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1319/1319 [00:00<00:00, 301783.06 examples/s]
Downloading readme: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 9.59k/9.59k [00:00<00:00, 18.6MB/s]
Downloading data: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 223k/223k [00:00<00:00, 1.01MB/s]
Generating validation split: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 817/817 [00:00<00:00, 87243.40 examples/s]
Map: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 817/817 [00:00<00:00, 9415.25 examples/s]
Downloading data: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 271k/271k [00:00<00:00, 1.08MB/s]
Generating validation split: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 817/817 [00:00<00:00, 84042.44 examples/s]
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,319 WARNING [huggingface.py:469] model.chat_template was called with the chat_template set to False or None. Therefore no chat template will be applied. Make sure this is an intended behavior.
2025-05-01:18:20:12,323 INFO [task.py:423] Building contexts for winogrande on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1267/1267 [00:00<00:00, 3394.33it/s]
2025-05-01:18:20:12,730 INFO [task.py:423] Building contexts for truthfulqa_mc2 on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 817/817 [00:00<00:00, 1019.90it/s]
2025-05-01:18:20:13,577 INFO [task.py:423] Building contexts for truthfulqa_mc1 on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 817/817 [00:00<00:00, 1043.73it/s]
2025-05-01:18:20:14,405 INFO [task.py:423] Building contexts for truthfulqa_gen on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 817/817 [00:00<00:00, 1768.20it/s]
2025-05-01:18:20:14,916 INFO [task.py:423] Building contexts for piqa on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1838/1838 [00:01<00:00, 1469.69it/s]
2025-05-01:18:20:16,220 INFO [task.py:423] Building contexts for hellaswag on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 10042/10042 [00:03<00:00, 2837.28it/s]
2025-05-01:18:20:20,682 INFO [task.py:423] Building contexts for gsm8k on rank 0...
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1319/1319 [00:04<00:00, 306.52it/s]
2025-05-01:18:20:25,014 INFO [evaluator.py:463] Running loglikelihood requests
Running loglikelihood requests: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 56374/56374 [08:22<00:00, 112.17it/s]
2025-05-01:18:29:05,459 INFO [evaluator.py:463] Running generate_until requests
Running generate_until requests: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2136/2136 [20:12<00:00, 1.76it/s]
2025-05-01:18:49:18,296 INFO [rouge_scorer.py:83] Using default tokenizer.
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
2025-05-01:18:58:16,987 INFO [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
hf (pretrained=microsoft/bitnet-b1.58-2B-4T,dtype=float16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | β | 0.0000 | Β± | 0.0000 |
strict-match | 5 | exact_match | β | 0.0000 | Β± | 0.0000 | ||
hellaswag | 1 | none | 0 | acc | β | 0.2504 | Β± | 0.0043 |
none | 0 | acc_norm | β | 0.2504 | Β± | 0.0043 | ||
piqa | 1 | none | 0 | acc | β | 0.4951 | Β± | 0.0117 |
none | 0 | acc_norm | β | 0.4951 | Β± | 0.0117 | ||
truthfulqa_gen | 3 | none | 0 | bleu_acc | β | 0.0000 | Β± | 0.0000 |
none | 0 | bleu_diff | β | -0.0002 | Β± | 0.0002 | ||
none | 0 | bleu_max | β | 0.0000 | Β± | 0.0000 | ||
none | 0 | rouge1_acc | β | 0.0000 | Β± | 0.0000 | ||
none | 0 | rouge1_diff | β | 0.0000 | Β± | 0.0000 | ||
none | 0 | rouge1_max | β | 0.0000 | Β± | 0.0000 | ||
none | 0 | rouge2_acc | β | 0.0000 | Β± | 0.0000 | ||
none | 0 | rouge2_diff | β | 0.0000 | Β± | 0.0000 | ||
none | 0 | rouge2_max | β | 0.0000 | Β± | 0.0000 | ||
none | 0 | rougeL_acc | β | 0.0000 | Β± | 0.0000 | ||
none | 0 | rougeL_diff | β | 0.0000 | Β± | 0.0000 | ||
none | 0 | rougeL_max | β | 0.0000 | Β± | 0.0000 | ||
truthfulqa_mc1 | 2 | none | 0 | acc | β | 1.0000 | Β± | 0.0000 |
truthfulqa_mc2 | 2 | none | 0 | acc | β | NaN | Β± | NaN |
winogrande | 1 | none | 0 | acc | β | 0.4957 | Β± | 0.0141 |