microsoft/bitnet-b1.58-2B-4T · Run inference with lm-evaluation-harness generated strange accuracy results

Hi, I set up transformer accordingly, and tried to use lm-evaluation-harness to generate some benchmark accuracy, but I got quite strange accuracy results, any clue what was the reason?

Running command:
python3 -m lm_eval --model hf --model_args pretrained=microsoft/bitnet-b1.58-2B-4T,dtype=float16 --tasks hellaswag,winogrande,piqa,gsm8k,truthfulqa --device cuda --batch_size 64

Running log:
############################################################################################
2025-05-01:18:19:32,020 INFO [main.py:308] Verbosity set to INFO
2025-05-01:18:19:32,164 INFO [init.py:491] group and group_alias keys in tasks' configs will no longer be used in the next release of lm-eval. tag will be used to allow to call a collection of tasks just like group. group will be removed in order to not cause confusion with the new ConfigurableGroup which will be the offical way to create groups with addition of group-wide configuations.
2025-05-01:18:19:36,676 INFO [main.py:414] Selected Tasks: ['gsm8k', 'hellaswag', 'piqa', 'truthfulqa', 'winogrande']
2025-05-01:18:19:36,678 INFO [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2025-05-01:18:19:36,678 INFO [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'microsoft/bitnet-b1.58-2B-4T', 'dtype': 'float16'}
2025-05-01:18:19:36,733 WARNING [other.py:349] Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2025-05-01:18:19:36,734 INFO [huggingface.py:130] Using device 'cuda'
2025-05-01:18:19:37,561 INFO [huggingface.py:366] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.94k/7.94k [00:00<00:00, 13.6MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.31M/2.31M [00:00<00:00, 8.27MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 419k/419k [00:00<00:00, 2.09MB/s]
Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 7473/7473 [00:00<00:00, 188559.36 examples/s]
Generating test split: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 301783.06 examples/s]
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.59k/9.59k [00:00<00:00, 18.6MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 223k/223k [00:00<00:00, 1.01MB/s]
Generating validation split: 100%|████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 87243.40 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 9415.25 examples/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 271k/271k [00:00<00:00, 1.08MB/s]
Generating validation split: 100%|████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 84042.44 examples/s]
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,319 WARNING [huggingface.py:469] model.chat_template was called with the chat_template set to False or None. Therefore no chat template will be applied. Make sure this is an intended behavior.
2025-05-01:18:20:12,323 INFO [task.py:423] Building contexts for winogrande on rank 0...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1267/1267 [00:00<00:00, 3394.33it/s]
2025-05-01:18:20:12,730 INFO [task.py:423] Building contexts for truthfulqa_mc2 on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 1019.90it/s]
2025-05-01:18:20:13,577 INFO [task.py:423] Building contexts for truthfulqa_mc1 on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 1043.73it/s]
2025-05-01:18:20:14,405 INFO [task.py:423] Building contexts for truthfulqa_gen on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 1768.20it/s]
2025-05-01:18:20:14,916 INFO [task.py:423] Building contexts for piqa on rank 0...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1838/1838 [00:01<00:00, 1469.69it/s]
2025-05-01:18:20:16,220 INFO [task.py:423] Building contexts for hellaswag on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10042/10042 [00:03<00:00, 2837.28it/s]
2025-05-01:18:20:20,682 INFO [task.py:423] Building contexts for gsm8k on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:04<00:00, 306.52it/s]
2025-05-01:18:20:25,014 INFO [evaluator.py:463] Running loglikelihood requests
Running loglikelihood requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 56374/56374 [08:22<00:00, 112.17it/s]
2025-05-01:18:29:05,459 INFO [evaluator.py:463] Running generate_until requests
Running generate_until requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 2136/2136 [20:12<00:00, 1.76it/s]
2025-05-01:18:49:18,296 INFO [rouge_scorer.py:83] Using default tokenizer.
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
2025-05-01:18:58:16,987 INFO [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
hf (pretrained=microsoft/bitnet-b1.58-2B-4T,dtype=float16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.0000	±	0.0000
		strict-match	5	exact_match	↑	0.0000	±	0.0000
hellaswag	1	none	0	acc	↑	0.2504	±	0.0043
		none	0	acc_norm	↑	0.2504	±	0.0043
piqa	1	none	0	acc	↑	0.4951	±	0.0117
		none	0	acc_norm	↑	0.4951	±	0.0117
truthfulqa_gen	3	none	0	bleu_acc	↑	0.0000	±	0.0000
		none	0	bleu_diff	↑	-0.0002	±	0.0002
		none	0	bleu_max	↑	0.0000	±	0.0000
		none	0	rouge1_acc	↑	0.0000	±	0.0000
		none	0	rouge1_diff	↑	0.0000	±	0.0000
		none	0	rouge1_max	↑	0.0000	±	0.0000
		none	0	rouge2_acc	↑	0.0000	±	0.0000
		none	0	rouge2_diff	↑	0.0000	±	0.0000
		none	0	rouge2_max	↑	0.0000	±	0.0000
		none	0	rougeL_acc	↑	0.0000	±	0.0000
		none	0	rougeL_diff	↑	0.0000	±	0.0000
		none	0	rougeL_max	↑	0.0000	±	0.0000
truthfulqa_mc1	2	none	0	acc	↑	1.0000	±	0.0000
truthfulqa_mc2	2	none	0	acc	↑	NaN	±	NaN
winogrande	1	none	0	acc	↑	0.4957	±	0.0141