NeMo
English
nvidia
steerlm
reward model

Missing model_weights/model.rm_head._extra_state

#1
by keminglu - opened

Thanks for sharing such a great reward model!

I am trying to reproduce the result on RewardBench and serve this checkpoint with the official docker built from the Dockerfile:

nemo_aligner                  0.4.0.dev0
nemo_toolkit                  2.0.0rc0
megatron_core                 0.7.0
transformer-engine            1.7.0.dev0+a51ff54
transformers                  4.40.2

However, I run into this checkpoint missing error:

load
    sharded_objects, sharded_state_dict = load_sharded_objects(
  File "/home/data/lukeming.lkm/NeMo-Aligner/build_env/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 227, in load_sharded_objects
    return dict_list_map_inplace(load_sharded_object, sharded_objects), sharded_state_dict
  File "/home/data/lukeming.lkm/NeMo-Aligner/build_env/Megatron-LM/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/home/data/lukeming.lkm/NeMo-Aligner/build_env/Megatron-LM/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/home/data/lukeming.lkm/NeMo-Aligner/build_env/Megatron-LM/megatron/core/dist_checkpointing/dict_utils.py", line 184, in dict_list_map_inplace
    return f(x)
  File "/home/data/lukeming.lkm/NeMo-Aligner/build_env/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 224, in load_sharded_object
    raise CheckpointingException(err_msg) from e
megatron.core.dist_checkpointing.core.CheckpointingException: Object shard /cpfs_01/296bu4pgyxye1ubw3fj/data/shared/Group-m6/lukeming.lkm/ckpts/public/Nemotron-4-340B-Reward/model_weights/model.rm_head._extra_state/shard_0_1.pt not found

It looks like there is no model.rm_head._extra_state part in the checkpoint.

NVIDIA org

Hi the issue was with a wrong container that we pointed to initially. We have fixed it in the model card with the path to the right container. Please try it and let us know if you have further questions

keminglu changed discussion status to closed

Sign up or log in to comment