W0217 04:25:43.887000 4078901 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:43.887000 4078901 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:43.887000 4078901 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:43.887000 4078901 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:43.887000 262147 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:43.887000 262147 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:43.887000 262147 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:43.887000 262147 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.057000 2598250 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:44.057000 2598250 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.057000 2598250 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:44.057000 2598250 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.058000 2621360 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:44.058000 2621360 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.058000 2621360 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:44.058000 2621360 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.073000 2598122 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:44.073000 2598122 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.073000 2598122 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:44.073000 2598122 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.126000 2608625 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:44.126000 2608625 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.126000 2608625 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:44.126000 2608625 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.151000 2614682 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:44.151000 2614682 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.151000 2614682 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:44.151000 2614682 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.243000 233265 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:44.243000 233265 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.243000 233265 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:44.243000 233265 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.257000 262236 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:44.257000 262236 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.257000 262236 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:44.257000 262236 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.283000 2578773 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:44.283000 2578773 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.283000 2578773 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:44.283000 2578773 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.512000 1019786 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:44.512000 1019786 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.512000 1019786 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:44.512000 1019786 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.526000 2598810 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:44.526000 2598810 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.526000 2598810 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:44.526000 2598810 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.720000 1533179 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:44.720000 1533179 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.720000 1533179 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:44.720000 1533179 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.801000 2629720 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:44.801000 2629720 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:44.801000 2629720 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:44.801000 2629720 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:45.759000 237666 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:45.759000 237666 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:45.759000 237666 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:45.759000 237666 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:47.149000 2570963 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 04:25:47.149000 2570963 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 04:25:47.149000 2570963 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 04:25:47.149000 2570963 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices loading configuration file config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/config.json You are using a model of type qwen2_5_vl to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors. Model config LlavaQwenConfig { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 128000, "max_window_layers": 28, "model_type": "llava_qwen", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.49.0.dev0", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "hidden_size": 1280, "in_chans": 3, "model_type": "qwen2_5_vl", "spatial_patch_size": 14, "tokens_per_second": 2 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 152064 } loading weights file model.safetensors from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/model.safetensors.index.json Instantiating LlavaQwenForCausalLM model under default dtype torch.bfloat16. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 } Instantiating Qwen2_5_VisionTransformerPretrainedModel model under default dtype torch.bfloat16. PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices loading configuration file config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/config.json You are using a model of type qwen2_5_vl to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors. Model config LlavaQwenConfig { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 128000, "max_window_layers": 28, "model_type": "llava_qwen", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.49.0.dev0", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "hidden_size": 1280, "in_chans": 3, "model_type": "qwen2_5_vl", "spatial_patch_size": 14, "tokens_per_second": 2 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 152064 } loading weights file model.safetensors from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/model.safetensors.index.json PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices Instantiating LlavaQwenForCausalLM model under default dtype torch.bfloat16. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 } Instantiating Qwen2_5_VisionTransformerPretrainedModel model under default dtype torch.bfloat16. PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices Loading checkpoint shards: 0%| | 0/5 [00:00', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.55it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.29it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.92it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.56it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.88it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.57it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.81it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.52it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.02it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.68it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } �██ | 4/5 [00:01<00:00, 3.59it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.47it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.09it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.59it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.23it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.33it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.16it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.87it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.47it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.75it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.32it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.63it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.33it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.88it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.55it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.86it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.56it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.86it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.44it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.87it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.48it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.28it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.00it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.19it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.97it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.70it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.38it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.14it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.95it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.22it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.01it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.97it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.78it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.05it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.86it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.99it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.58it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.80it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.78it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.31it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.08it/s] You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc All model checkpoint weights were used when initializing LlavaQwenForCausalLM. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.05it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.86it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.03it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.83it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.11it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.82it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } ��█████ | 3/5 [00:00<00:00, 3.82it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:00<00:00, 4.31it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:00<00:00, 3.89it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:01<00:00, 3.17it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:00<00:00, 4.56it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:00<00:00, 4.63it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:00<00:00, 4.72it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:00<00:00, 4.58it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:00<00:00, 4.10it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:01<00:00, 3.95it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:01<00:00, 4.01it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.45it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.10it/s] Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.04it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.80it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.83it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.54it/s] Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), All model checkpoint weights were used when initializing LlavaQwenForCausalLM. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading file added_tokens.json from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file special_tokens_map.json from cache at None loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file chat_template.jinja from cache at None loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.36it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.01it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.65it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.44it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.94it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.62it/s] Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, loading file chat_template.jinja from cache at None "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.83it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.56it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.26it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.07it/s] You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json All model checkpoint weights were used when initializing LlavaQwenForCausalLM. loading file added_tokens.json from cache at None All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.82it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.56it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.24it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.98it/s] loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.17it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.96it/s] loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None All model checkpoint weights were used when initializing LlavaQwenForCausalLM. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.96it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.89it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file special_tokens_map.json from cache at None loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file added_tokens.json from cache at None loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file special_tokens_map.json from cache at None loading file added_tokens.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file special_tokens_map.json from cache at None loading file chat_template.jinja from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.17it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.94it/s] loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt All model checkpoint weights were used when initializing LlavaQwenForCausalLM. loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.95it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 3.95it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading file added_tokens.json from cache at None Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file chat_template.jinja from cache at None - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), loading file added_tokens.json from cache at None 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.27it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.01it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), loading file chat_template.jinja from cache at None 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file chat_template.jinja from cache at None loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Generating dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Downloading and preparing dataset webdataset/default to /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f... Spawning 128 processes for 128 objects in slices of [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Downloading data: 0%| | 0/1382 [00:00 train(attn_implementation="flash_attention_2") File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train trainer.train() File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train return inner_training_loop( File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop self._maybe_log_save_evaluate( File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate self._save_checkpoint(model, trial) File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint self._save_optimizer_and_scheduler(output_dir) File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler self.model_wrapped.save_checkpoint(output_dir) File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3199, in save_checkpoint self._save_checkpoint(save_dir, File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3421, in _save_checkpoint self.checkpoint_engine.save(state, save_path) File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save torch.save(state_dict, path) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/serialization.py", line 849, in save with _open_zipfile_writer(f) as opened_zipfile: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/serialization.py", line 690, in __exit__ self.file_like.write_end_of_file() RuntimeError: [enforce fail at inline_container.cc:603] . unexpected pos 27282028288 vs 27282028184 [rank0]: Traceback (most recent call last): [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/serialization.py", line 850, in save [rank0]: _save( [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/serialization.py", line 1114, in _save [rank0]: zip_file.write_record(name, storage, num_bytes) [rank0]: RuntimeError: [enforce fail at inline_container.cc:778] . PytorchStreamWriter failed writing file data/723: file write failed [rank0]: During handling of the above exception, another exception occurred: [rank0]: Traceback (most recent call last): [rank0]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank0]: train(attn_implementation="flash_attention_2") [rank0]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank0]: trainer.train() [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank0]: return inner_training_loop( [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank0]: self._maybe_log_save_evaluate( [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank0]: self._save_checkpoint(model, trial) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank0]: self._save_optimizer_and_scheduler(output_dir) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank0]: self.model_wrapped.save_checkpoint(output_dir) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3199, in save_checkpoint [rank0]: self._save_checkpoint(save_dir, [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3421, in _save_checkpoint [rank0]: self.checkpoint_engine.save(state, save_path) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save [rank0]: torch.save(state_dict, path) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/serialization.py", line 849, in save [rank0]: with _open_zipfile_writer(f) as opened_zipfile: [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/serialization.py", line 690, in __exit__ [rank0]: self.file_like.write_end_of_file() [rank0]: RuntimeError: [enforce fail at inline_container.cc:603] . unexpected pos 27282028288 vs 27282028184 [rank0]:[W217 14:56:40.817564534 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank35]:[E217 15:05:47.754221661 ProcessGroupNCCL.cpp:616] [Rank 35] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [rank18]:[E217 15:05:47.992305383 ProcessGroupNCCL.cpp:616] [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. [rank10]:[E217 15:05:47.687980560 ProcessGroupNCCL.cpp:616] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. [rank63]:[E217 15:05:47.426428481 ProcessGroupNCCL.cpp:616] [Rank 63] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [rank81]:[E217 15:05:47.905351523 ProcessGroupNCCL.cpp:616] [Rank 81] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. [rank85]:[E217 15:05:47.905339943 ProcessGroupNCCL.cpp:616] [Rank 85] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [rank60]:[E217 15:05:47.448301694 ProcessGroupNCCL.cpp:616] [Rank 60] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [rank78]:[E217 15:05:47.771998388 ProcessGroupNCCL.cpp:616] [Rank 78] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. [rank124]:[E217 15:05:47.531169090 ProcessGroupNCCL.cpp:616] [Rank 124] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. [rank85]:[E217 15:05:47.918440673 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 85] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank117]:[E217 15:05:47.719825730 ProcessGroupNCCL.cpp:616] [Rank 117] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [rank60]:[E217 15:05:47.462816917 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 60] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank126]:[E217 15:05:47.542745052 ProcessGroupNCCL.cpp:616] [Rank 126] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. [rank48]:[E217 15:05:47.537742492 ProcessGroupNCCL.cpp:616] [Rank 48] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. [rank119]:[E217 15:05:47.727916678 ProcessGroupNCCL.cpp:616] [Rank 119] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. [rank115]:[E217 15:05:47.732946367 ProcessGroupNCCL.cpp:616] [Rank 115] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. [rank68]:[E217 15:05:47.455795810 ProcessGroupNCCL.cpp:616] [Rank 68] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600001 milliseconds before timing out. [rank61]:[E217 15:05:47.414182646 ProcessGroupNCCL.cpp:616] [Rank 61] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [rank90]:[E217 15:05:47.964278117 ProcessGroupNCCL.cpp:616] [Rank 90] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. [rank119]:[E217 15:05:47.741088566 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 119] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank115]:[E217 15:05:47.741093137 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 115] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank94]:[E217 15:05:47.956313507 ProcessGroupNCCL.cpp:616] [Rank 94] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [rank66]:[E217 15:05:47.611075242 ProcessGroupNCCL.cpp:616] [Rank 66] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. [rank43]:[E217 15:05:47.888693855 ProcessGroupNCCL.cpp:616] [Rank 43] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. [rank75]:[E217 15:05:47.743035778 ProcessGroupNCCL.cpp:616] [Rank 75] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [rank68]:[E217 15:05:47.613283446 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 68] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank63]:[E217 15:05:47.492407420 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 63] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank35]:[E217 15:05:47.850879614 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 35] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank100]:[E217 15:05:47.393054465 ProcessGroupNCCL.cpp:616] [Rank 100] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. [rank51]:[E217 15:05:47.495929377 ProcessGroupNCCL.cpp:616] [Rank 51] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. [rank18]:[E217 15:05:47.088676806 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 18] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank22]:[E217 15:05:47.027333699 ProcessGroupNCCL.cpp:616] [Rank 22] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. [rank50]:[E217 15:05:47.502270663 ProcessGroupNCCL.cpp:616] [Rank 50] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. [rank100]:[E217 15:05:47.490455639 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 100] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank13]:[E217 15:05:47.766437134 ProcessGroupNCCL.cpp:616] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. [rank113]:[E217 15:05:47.775973405 ProcessGroupNCCL.cpp:616] [Rank 113] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. [rank46]:[E217 15:05:47.853078542 ProcessGroupNCCL.cpp:616] [Rank 46] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. [rank28]:[E217 15:05:47.627885408 ProcessGroupNCCL.cpp:616] [Rank 28] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. [rank15]:[E217 15:05:47.692931654 ProcessGroupNCCL.cpp:616] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [rank95]:[E217 15:05:47.990952407 ProcessGroupNCCL.cpp:616] [Rank 95] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [rank34]:[E217 15:05:47.798737792 ProcessGroupNCCL.cpp:616] [Rank 34] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. [rank58]:[E217 15:05:47.445570533 ProcessGroupNCCL.cpp:616] [Rank 58] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. [rank83]:[E217 15:05:47.907946838 ProcessGroupNCCL.cpp:616] [Rank 83] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. [rank82]:[E217 15:05:47.906878935 ProcessGroupNCCL.cpp:616] [Rank 82] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [rank81]:[E217 15:05:47.989741370 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 81] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank69]:[E217 15:05:47.529268793 ProcessGroupNCCL.cpp:616] [Rank 69] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [rank123]:[E217 15:05:47.541197630 ProcessGroupNCCL.cpp:616] [Rank 123] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. [rank79]:[E217 15:05:47.773330847 ProcessGroupNCCL.cpp:616] [Rank 79] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. [rank39]:[E217 15:05:47.805921710 ProcessGroupNCCL.cpp:616] [Rank 39] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [rank126]:[E217 15:05:47.608672691 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 126] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank124]:[E217 15:05:47.608670751 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 124] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank23]:[E217 15:05:47.958314930 ProcessGroupNCCL.cpp:616] [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. [rank38]:[E217 15:05:47.802804236 ProcessGroupNCCL.cpp:616] [Rank 38] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600027 milliseconds before timing out. [rank77]:[E217 15:05:47.776712447 ProcessGroupNCCL.cpp:616] [Rank 77] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. [rank70]:[E217 15:05:47.601030618 ProcessGroupNCCL.cpp:616] [Rank 70] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. [rank49]:[E217 15:05:47.526834529 ProcessGroupNCCL.cpp:616] [Rank 49] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. [rank40]:[E217 15:05:47.938122073 ProcessGroupNCCL.cpp:616] [Rank 40] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. [rank117]:[E217 15:05:47.799923070 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 117] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank86]:[E217 15:05:47.922250177 ProcessGroupNCCL.cpp:616] [Rank 86] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [rank78]:[E217 15:05:47.863068346 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 78] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank71]:[E217 15:05:47.612132798 ProcessGroupNCCL.cpp:616] [Rank 71] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. [rank61]:[E217 15:05:47.478680209 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 61] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank106]:[E217 15:05:47.039017391 ProcessGroupNCCL.cpp:616] [Rank 106] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. [rank8]:[E217 15:05:47.782079724 ProcessGroupNCCL.cpp:616] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600006 milliseconds before timing out. [rank17]:[E217 15:05:47.054123414 ProcessGroupNCCL.cpp:616] [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. [rank62]:[E217 15:05:47.384332099 ProcessGroupNCCL.cpp:616] [Rank 62] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [rank11]:[E217 15:05:47.804105480 ProcessGroupNCCL.cpp:616] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600065 milliseconds before timing out. [rank7]:[E217 15:05:47.797753950 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. [rank107]:[E217 15:05:47.960527942 ProcessGroupNCCL.cpp:616] [Rank 107] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [rank14]:[E217 15:05:47.802775570 ProcessGroupNCCL.cpp:616] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. [rank76]:[E217 15:05:47.743032578 ProcessGroupNCCL.cpp:616] [Rank 76] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. [rank65]:[E217 15:05:47.596100133 ProcessGroupNCCL.cpp:616] [Rank 65] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. [rank42]:[E217 15:05:47.874844460 ProcessGroupNCCL.cpp:616] [Rank 42] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. [rank44]:[E217 15:05:47.877322269 ProcessGroupNCCL.cpp:616] [Rank 44] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. [rank116]:[E217 15:05:47.735749261 ProcessGroupNCCL.cpp:616] [Rank 116] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. [rank89]:[E217 15:05:47.025105336 ProcessGroupNCCL.cpp:616] [Rank 89] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [rank90]:[E217 15:05:47.106598579 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 90] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank36]:[E217 15:05:47.780186377 ProcessGroupNCCL.cpp:616] [Rank 36] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600026 milliseconds before timing out. [rank27]:[E217 15:05:47.657278716 ProcessGroupNCCL.cpp:616] [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [rank10]:[E217 15:05:47.833906926 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 10] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank45]:[E217 15:05:47.800975455 ProcessGroupNCCL.cpp:616] [Rank 45] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. [rank66]:[E217 15:05:47.689060633 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 66] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank37]:[E217 15:05:47.784044950 ProcessGroupNCCL.cpp:616] [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600027 milliseconds before timing out. [rank92]:[E217 15:05:47.956325497 ProcessGroupNCCL.cpp:616] [Rank 92] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. [rank108]:[E217 15:05:47.981190102 ProcessGroupNCCL.cpp:616] [Rank 108] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. [rank4]:[E217 15:05:47.812356983 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. [rank48]:[E217 15:05:47.637803363 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 48] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank99]:[E217 15:05:47.468480588 ProcessGroupNCCL.cpp:616] [Rank 99] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. [rank53]:[E217 15:05:47.564487954 ProcessGroupNCCL.cpp:616] [Rank 53] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. [rank15]:[E217 15:05:47.905803269 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 15] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank98]:[E217 15:05:47.471090538 ProcessGroupNCCL.cpp:616] [Rank 98] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. [rank93]:[E217 15:05:47.038609831 ProcessGroupNCCL.cpp:616] [Rank 93] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. [rank103]:[E217 15:05:47.473245789 ProcessGroupNCCL.cpp:616] [Rank 103] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. [rank1]:[E217 15:05:47.773883507 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [rank43]:[E217 15:05:47.976089928 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 43] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank21]:[E217 15:05:47.080598424 ProcessGroupNCCL.cpp:616] [Rank 21] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. [rank101]:[E217 15:05:47.393047785 ProcessGroupNCCL.cpp:616] [Rank 101] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. [rank38]:[E217 15:05:47.887905931 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 38] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank55]:[E217 15:05:47.495935997 ProcessGroupNCCL.cpp:616] [Rank 55] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. [rank102]:[E217 15:05:47.480550659 ProcessGroupNCCL.cpp:616] [Rank 102] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600094 milliseconds before timing out. [rank19]:[E217 15:05:47.040987245 ProcessGroupNCCL.cpp:616] [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [rank47]:[E217 15:05:47.920268233 ProcessGroupNCCL.cpp:616] [Rank 47] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600089 milliseconds before timing out. [rank79]:[E217 15:05:47.863073406 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 79] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank97]:[E217 15:05:47.483963905 ProcessGroupNCCL.cpp:616] [Rank 97] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600095 milliseconds before timing out. [rank29]:[E217 15:05:47.596031075 ProcessGroupNCCL.cpp:616] [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. [rank118]:[E217 15:05:47.716473687 ProcessGroupNCCL.cpp:616] [Rank 118] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. [rank74]:[E217 15:05:47.744479293 ProcessGroupNCCL.cpp:616] [Rank 74] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. [rank111]:[E217 15:05:47.998258493 ProcessGroupNCCL.cpp:616] [Rank 111] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. [rank109]:[E217 15:05:47.998175171 ProcessGroupNCCL.cpp:616] [Rank 109] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [rank116]:[E217 15:05:47.852370344 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 116] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank51]:[E217 15:05:47.587489267 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 51] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank69]:[E217 15:05:47.687872665 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 69] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank113]:[E217 15:05:47.856520118 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 113] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank122]:[E217 15:05:47.517999853 ProcessGroupNCCL.cpp:616] [Rank 122] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. [rank83]:[E217 15:05:47.059782167 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 83] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank87]:[E217 15:05:47.915353278 ProcessGroupNCCL.cpp:616] [Rank 87] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. [rank59]:[E217 15:05:47.442803300 ProcessGroupNCCL.cpp:616] [Rank 59] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. [rank32]:[E217 15:05:47.854434051 ProcessGroupNCCL.cpp:616] [Rank 32] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600064 milliseconds before timing out. [rank54]:[E217 15:05:47.523633850 ProcessGroupNCCL.cpp:616] [Rank 54] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. [rank86]:[E217 15:05:47.003158524 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 86] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank33]:[E217 15:05:47.793744830 ProcessGroupNCCL.cpp:616] [Rank 33] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. [rank56]:[E217 15:05:47.505120649 ProcessGroupNCCL.cpp:616] [Rank 56] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [rank84]:[E217 15:05:47.924177066 ProcessGroupNCCL.cpp:616] [Rank 84] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. [rank70]:[E217 15:05:47.695873627 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 70] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank127]:[E217 15:05:47.549356772 ProcessGroupNCCL.cpp:616] [Rank 127] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. [rank57]:[E217 15:05:47.450052895 ProcessGroupNCCL.cpp:616] [Rank 57] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600050 milliseconds before timing out. [rank39]:[E217 15:05:47.931856603 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 39] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank71]:[E217 15:05:47.703303601 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 71] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank75]:[E217 15:05:47.874644470 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 75] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank24]:[E217 15:05:47.596027685 ProcessGroupNCCL.cpp:616] [Rank 24] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. [rank12]:[E217 15:05:47.798301137 ProcessGroupNCCL.cpp:616] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600067 milliseconds before timing out. [rank73]:[E217 15:05:47.779021273 ProcessGroupNCCL.cpp:616] [Rank 73] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [rank102]:[E217 15:05:47.598885464 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 102] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank72]:[E217 15:05:47.836379880 ProcessGroupNCCL.cpp:616] [Rank 72] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600067 milliseconds before timing out. [rank114]:[E217 15:05:47.716459707 ProcessGroupNCCL.cpp:616] [Rank 114] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. [rank118]:[E217 15:05:47.847739005 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 118] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank30]:[E217 15:05:47.647180872 ProcessGroupNCCL.cpp:616] [Rank 30] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. [rank3]:[E217 15:05:47.874550342 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [rank67]:[E217 15:05:47.595132779 ProcessGroupNCCL.cpp:616] [Rank 67] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. [rank50]:[E217 15:05:47.659865215 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 50] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank26]:[E217 15:05:47.649369097 ProcessGroupNCCL.cpp:616] [Rank 26] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. [rank36]:[E217 15:05:47.955081404 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 36] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank62]:[E217 15:05:47.597303827 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 62] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank37]:[E217 15:05:47.957429598 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 37] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank82]:[E217 15:05:47.059367330 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 82] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank110]:[E217 15:05:47.960520252 ProcessGroupNCCL.cpp:616] [Rank 110] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. [rank91]:[E217 15:05:47.022828250 ProcessGroupNCCL.cpp:616] [Rank 91] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. [rank22]:[E217 15:05:47.155346145 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 22] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank65]:[E217 15:05:47.676477921 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 65] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank52]:[E217 15:05:47.593769451 ProcessGroupNCCL.cpp:616] [Rank 52] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600037 milliseconds before timing out. [rank87]:[E217 15:05:47.060650022 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 87] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank104]:[E217 15:05:47.016455605 ProcessGroupNCCL.cpp:616] [Rank 104] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [rank44]:[E217 15:05:47.036600186 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 44] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank54]:[E217 15:05:47.671509845 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 54] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank20]:[E217 15:05:47.058946180 ProcessGroupNCCL.cpp:616] [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600075 milliseconds before timing out. [rank6]:[E217 15:05:47.804753912 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. [rank40]:[E217 15:05:47.038675512 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 40] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank25]:[E217 15:05:47.734972099 ProcessGroupNCCL.cpp:616] [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. [rank45]:[E217 15:05:47.976088228 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 45] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank49]:[E217 15:05:47.675350627 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 49] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank94]:[E217 15:05:47.112315406 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 94] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank5]:[E217 15:05:47.806264441 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [rank106]:[E217 15:05:47.137938626 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 106] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank121]:[E217 15:05:47.641550071 ProcessGroupNCCL.cpp:616] [Rank 121] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [rank101]:[E217 15:05:47.594636363 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 101] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank46]:[E217 15:05:47.982396521 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 46] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank53]:[E217 15:05:47.687703771 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 53] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank105]:[E217 15:05:47.978467050 ProcessGroupNCCL.cpp:616] [Rank 105] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. [rank112]:[E217 15:05:47.799546651 ProcessGroupNCCL.cpp:616] [Rank 112] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [rank28]:[E217 15:05:47.759080651 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 28] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank2]:[E217 15:05:47.810395603 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. [rank31]:[E217 15:05:47.660546944 ProcessGroupNCCL.cpp:616] [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. [rank103]:[E217 15:05:47.592921470 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 103] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank55]:[E217 15:05:47.692620466 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 55] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank123]:[E217 15:05:47.667906286 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 123] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank29]:[E217 15:05:47.761809477 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 29] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank27]:[E217 15:05:47.752648219 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 27] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank41]:[E217 15:05:47.896291981 ProcessGroupNCCL.cpp:616] [Rank 41] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600082 milliseconds before timing out. [rank52]:[E217 15:05:47.705787368 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 52] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank13]:[E217 15:05:47.925842360 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 13] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank84]:[E217 15:05:47.104648894 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 84] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank127]:[E217 15:05:47.720087917 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 127] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank23]:[E217 15:05:47.198369088 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 23] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank76]:[E217 15:05:47.937313860 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 76] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank64]:[E217 15:05:47.683020671 ProcessGroupNCCL.cpp:616] [Rank 64] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. [rank96]:[E217 15:05:47.545973701 ProcessGroupNCCL.cpp:616] [Rank 96] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600090 milliseconds before timing out. [rank95]:[E217 15:05:47.149474917 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 95] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank58]:[E217 15:05:47.601961569 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 58] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank125]:[E217 15:05:47.641518529 ProcessGroupNCCL.cpp:616] [Rank 125] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [rank34]:[E217 15:05:47.958973059 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 34] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank80]:[E217 15:05:47.942988777 ProcessGroupNCCL.cpp:616] [Rank 80] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. [rank47]:[E217 15:05:47.048459490 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 47] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank19]:[E217 15:05:47.231809133 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 19] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank77]:[E217 15:05:47.933044288 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 77] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank120]:[E217 15:05:47.578497154 ProcessGroupNCCL.cpp:616] [Rank 120] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600067 milliseconds before timing out. [rank11]:[E217 15:05:47.962180715 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 11] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank14]:[E217 15:05:47.966751710 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 14] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank21]:[E217 15:05:47.241568610 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 21] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank17]:[E217 15:05:47.207104333 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 17] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank32]:[E217 15:05:47.051922977 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 32] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank73]:[E217 15:05:47.940656768 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 73] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank1]:[E217 15:05:47.966756540 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank9]:[E217 15:05:47.880302862 ProcessGroupNCCL.cpp:616] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600064 milliseconds before timing out. [rank42]:[E217 15:05:47.032881778 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 42] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank92]:[E217 15:05:47.192306102 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 92] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank91]:[E217 15:05:47.181221842 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 91] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank72]:[E217 15:05:47.941713840 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 72] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank74]:[E217 15:05:47.984153466 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 74] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank7]:[E217 15:05:47.966751389 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank110]:[E217 15:05:47.137939815 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 110] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank20]:[E217 15:05:47.273603743 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 20] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank107]:[E217 15:05:47.137945576 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 107] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank24]:[E217 15:05:47.795671059 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 24] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank125]:[E217 15:05:47.784579320 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 125] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank89]:[E217 15:05:47.187381257 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 89] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank108]:[E217 15:05:47.147142344 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 108] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank12]:[E217 15:05:47.024761854 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 12] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank30]:[E217 15:05:47.872635644 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 30] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank67]:[E217 15:05:47.822461981 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 67] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank99]:[E217 15:05:47.627673385 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 99] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank4]:[E217 15:05:47.975661807 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank111]:[E217 15:05:47.167662281 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 111] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank109]:[E217 15:05:47.167475387 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 109] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank98]:[E217 15:05:47.631478818 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 98] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank93]:[E217 15:05:47.204099738 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 93] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank31]:[E217 15:05:47.884725186 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 31] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank8]:[E217 15:05:47.977186360 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 8] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank59]:[E217 15:05:47.675988266 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 59] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank97]:[E217 15:05:47.643183372 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 97] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank114]:[E217 15:05:47.963080163 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 114] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank41]:[E217 15:05:47.125032277 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 41] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank33]:[E217 15:05:47.040465034 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 33] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank57]:[E217 15:05:47.690272686 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 57] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank122]:[E217 15:05:47.751202014 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 122] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank26]:[E217 15:05:47.879487972 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 26] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank16]:[E217 15:05:47.139761730 ProcessGroupNCCL.cpp:616] [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [rank3]:[E217 15:05:47.030407065 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank25]:[E217 15:05:47.895399593 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 25] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank6]:[E217 15:05:47.044157261 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank5]:[E217 15:05:47.049090786 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank121]:[E217 15:05:47.799622126 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 121] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank2]:[E217 15:05:47.058446403 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank105]:[E217 15:05:47.220218828 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 105] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank9]:[E217 15:05:47.105912081 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 9] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank56]:[E217 15:05:47.706476786 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 56] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank104]:[E217 15:05:47.239854618 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 104] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank120]:[E217 15:05:47.859221740 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 120] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank112]:[E217 15:05:47.013610438 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 112] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank80]:[E217 15:05:47.244566420 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 80] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank88]:[E217 15:05:47.219879040 ProcessGroupNCCL.cpp:616] [Rank 88] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. [rank96]:[E217 15:05:47.764996433 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 96] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank64]:[E217 15:05:47.909325308 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 64] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank16]:[E217 15:05:47.460077230 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 16] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank88]:[E217 15:05:47.428671962 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 88] Exception (either an error or timeout) detected by watchdog at work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank48]:[E217 15:05:47.230204390 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 48] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank48]:[E217 15:05:47.230242041 ProcessGroupNCCL.cpp:630] [Rank 48] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank48]:[E217 15:05:47.230250341 ProcessGroupNCCL.cpp:636] [Rank 48] To avoid data inconsistency, we are taking the entire process down. [rank24]:[E217 15:05:47.445165617 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 24] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank24]:[E217 15:05:47.445190688 ProcessGroupNCCL.cpp:630] [Rank 24] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank24]:[E217 15:05:47.445195318 ProcessGroupNCCL.cpp:636] [Rank 24] To avoid data inconsistency, we are taking the entire process down. [rank48]: Traceback (most recent call last): [rank48]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank48]: train(attn_implementation="flash_attention_2") [rank48]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank48]: trainer.train() [rank48]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank48]: return inner_training_loop( [rank48]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank48]: self._maybe_log_save_evaluate( [rank48]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank48]: self._save_checkpoint(model, trial) [rank48]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank48]: self._save_optimizer_and_scheduler(output_dir) [rank48]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank48]: self.model_wrapped.save_checkpoint(output_dir) [rank48]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank48]: self._create_zero_checkpoint_files(save_dir, tag) [rank48]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank48]: dist.barrier(group=self.optimizer.dp_process_group) [rank48]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank48]: return func(*args, **kwargs) [rank48]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank48]: return cdb.barrier(group=group, async_op=async_op) [rank48]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank48]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank48]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank48]: return func(*args, **kwargs) [rank48]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank48]: work = group.barrier(opts=opts) [rank48]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 48. [rank24]: Traceback (most recent call last): [rank24]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank24]: train(attn_implementation="flash_attention_2") [rank24]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank24]: trainer.train() [rank24]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank24]: return inner_training_loop( [rank24]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank24]: self._maybe_log_save_evaluate( [rank24]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank24]: self._save_checkpoint(model, trial) [rank24]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank24]: self._save_optimizer_and_scheduler(output_dir) [rank24]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank24]: self.model_wrapped.save_checkpoint(output_dir) [rank24]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank24]: self._create_zero_checkpoint_files(save_dir, tag) [rank24]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank24]: dist.barrier(group=self.optimizer.dp_process_group) [rank24]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank24]: return func(*args, **kwargs) [rank24]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank24]: return cdb.barrier(group=group, async_op=async_op) [rank24]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank24]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank24]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank24]: return func(*args, **kwargs) [rank24]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank24]: work = group.barrier(opts=opts) [rank24]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 24. [rank40]:[E217 15:05:48.838832847 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 40] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank40]:[E217 15:05:48.838859529 ProcessGroupNCCL.cpp:630] [Rank 40] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank40]:[E217 15:05:48.838865579 ProcessGroupNCCL.cpp:636] [Rank 40] To avoid data inconsistency, we are taking the entire process down. [rank24]:[E217 15:05:48.619520994 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 24] Process group watchdog thread terminated with exception: [Rank 24] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7141a42e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x71415962a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x714159631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x71415963361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7141a4a735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7141a8c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7141a8d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 24] Process group watchdog thread terminated with exception: [Rank 24] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7141a42e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x71415962a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x714159631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x71415963361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7141a4a735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7141a8c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7141a8d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7141a42e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7141592a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7141a4a735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7141a8c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7141a8d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank32]:[E217 15:05:48.806799725 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 32] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank32]:[E217 15:05:48.806831227 ProcessGroupNCCL.cpp:630] [Rank 32] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank32]:[E217 15:05:48.806837927 ProcessGroupNCCL.cpp:636] [Rank 32] To avoid data inconsistency, we are taking the entire process down. [rank48]:[E217 15:05:48.526981651 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 48] Process group watchdog thread terminated with exception: [Rank 48] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75a35036c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75a305a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75a305a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75a305a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75a3514635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75a355094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75a355126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 48] Process group watchdog thread terminated with exception: [Rank 48] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75a35036c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75a305a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75a305a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75a305a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75a3514635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75a355094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75a355126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75a35036c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x75a3056a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x75a3514635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x75a355094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x75a355126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank8]:[E217 15:05:48.804262059 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 8] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank8]:[E217 15:05:48.804282770 ProcessGroupNCCL.cpp:630] [Rank 8] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank8]:[E217 15:05:48.804288930 ProcessGroupNCCL.cpp:636] [Rank 8] To avoid data inconsistency, we are taking the entire process down. [rank72]:[E217 15:05:48.854738694 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 72] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank72]:[E217 15:05:48.854765866 ProcessGroupNCCL.cpp:630] [Rank 72] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank72]:[E217 15:05:48.854770856 ProcessGroupNCCL.cpp:636] [Rank 72] To avoid data inconsistency, we are taking the entire process down. [rank32]: Traceback (most recent call last): [rank32]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank32]: train(attn_implementation="flash_attention_2") [rank32]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank32]: trainer.train() [rank32]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank32]: return inner_training_loop( [rank32]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank32]: self._maybe_log_save_evaluate( [rank32]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank32]: self._save_checkpoint(model, trial) [rank32]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank32]: self._save_optimizer_and_scheduler(output_dir) [rank32]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank32]: self.model_wrapped.save_checkpoint(output_dir) [rank32]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank32]: self._create_zero_checkpoint_files(save_dir, tag) [rank32]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank32]: dist.barrier(group=self.optimizer.dp_process_group) [rank32]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank32]: return func(*args, **kwargs) [rank32]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank32]: return cdb.barrier(group=group, async_op=async_op) [rank32]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank32]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank32]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank32]: return func(*args, **kwargs) [rank32]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank32]: work = group.barrier(opts=opts) [rank32]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 32. [rank8]:[E217 15:05:48.961748511 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 8] Process group watchdog thread terminated with exception: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600006 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7208ae56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x720863c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x720863c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x720863c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7208af25c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7208b3294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7208b3326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 8] Process group watchdog thread terminated with exception: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600006 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7208ae56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x720863c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x720863c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x720863c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7208af25c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7208b3294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7208b3326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7208ae56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7208638a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7208af25c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7208b3294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7208b3326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank40]: Traceback (most recent call last): [rank40]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank40]: train(attn_implementation="flash_attention_2") [rank40]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank40]: trainer.train() [rank40]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank40]: return inner_training_loop( [rank40]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank40]: self._maybe_log_save_evaluate( [rank40]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank40]: self._save_checkpoint(model, trial) [rank40]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank40]: self._save_optimizer_and_scheduler(output_dir) [rank40]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank40]: self.model_wrapped.save_checkpoint(output_dir) [rank40]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank40]: self._create_zero_checkpoint_files(save_dir, tag) [rank40]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank40]: dist.barrier(group=self.optimizer.dp_process_group) [rank40]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank40]: return func(*args, **kwargs) [rank40]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank40]: return cdb.barrier(group=group, async_op=async_op) [rank40]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank40]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank40]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank40]: return func(*args, **kwargs) [rank40]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank40]: work = group.barrier(opts=opts) [rank40]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 40. [rank32]:[E217 15:05:48.022969103 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 32] Process group watchdog thread terminated with exception: [Rank 32] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600064 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2d44393446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f2cf962a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f2cf9631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f2cf963361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f2d444ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f2d48c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f2d48d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 32] Process group watchdog thread terminated with exception: [Rank 32] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600064 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2d44393446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f2cf962a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f2cf9631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f2cf963361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f2d444ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f2d48c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f2d48d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2d44393446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f2cf92a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f2d444ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f2d48c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f2d48d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank72]: Traceback (most recent call last): [rank72]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank72]: train(attn_implementation="flash_attention_2") [rank72]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank72]: trainer.train() [rank72]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank72]: return inner_training_loop( [rank72]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank72]: self._maybe_log_save_evaluate( [rank72]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank72]: self._save_checkpoint(model, trial) [rank72]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank72]: self._save_optimizer_and_scheduler(output_dir) [rank72]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank72]: self.model_wrapped.save_checkpoint(output_dir) [rank72]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank72]: self._create_zero_checkpoint_files(save_dir, tag) [rank72]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank72]: dist.barrier(group=self.optimizer.dp_process_group) [rank72]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank72]: return func(*args, **kwargs) [rank72]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank72]: return cdb.barrier(group=group, async_op=async_op) [rank72]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank72]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank72]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank72]: return func(*args, **kwargs) [rank72]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank72]: work = group.barrier(opts=opts) [rank72]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 72. [rank40]:[E217 15:05:48.118995402 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 40] Process group watchdog thread terminated with exception: [Rank 40] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71d9d50c1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x71d98a42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x71d98a431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x71d98a43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71d9d521c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x71d9d9a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x71d9d9b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 40] Process group watchdog thread terminated with exception: [Rank 40] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71d9d50c1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x71d98a42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x71d98a431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x71d98a43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71d9d521c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x71d9d9a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x71d9d9b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71d9d50c1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x71d98a0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x71d9d521c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x71d9d9a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x71d9d9b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank72]:[E217 15:05:48.048195288 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 72] Process group watchdog thread terminated with exception: [Rank 72] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600067 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75cc2092a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75cbd5c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75cbd5c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75cbd5c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75cc210735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75cc25294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75cc25326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 72] Process group watchdog thread terminated with exception: [Rank 72] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600067 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75cc2092a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75cbd5c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75cbd5c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75cbd5c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75cc210735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75cc25294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75cc25326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75cc2092a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x75cbd58a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x75cc210735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x75cc25294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x75cc25326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank120]:[E217 15:05:48.864758082 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 120] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank120]:[E217 15:05:48.864777813 ProcessGroupNCCL.cpp:630] [Rank 120] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank120]:[E217 15:05:48.864783113 ProcessGroupNCCL.cpp:636] [Rank 120] To avoid data inconsistency, we are taking the entire process down. [rank120]:[E217 15:05:48.002628528 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 120] Process group watchdog thread terminated with exception: [Rank 120] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600067 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x784cd396c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x784c8902a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x784c89031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x784c8903361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x784cd4a565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x784cd8694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x784cd8726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 120] Process group watchdog thread terminated with exception: [Rank 120] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600067 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x784cd396c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x784c8902a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x784c89031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x784c8903361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x784cd4a565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x784cd8694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x784cd8726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x784cd396c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x784c88ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x784cd4a565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x784cd8694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x784cd8726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank56]:[E217 15:05:48.930671946 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 56] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank56]:[E217 15:05:48.930697597 ProcessGroupNCCL.cpp:630] [Rank 56] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank56]:[E217 15:05:48.930702388 ProcessGroupNCCL.cpp:636] [Rank 56] To avoid data inconsistency, we are taking the entire process down. [rank112]:[E217 15:05:48.245114046 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 112] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank112]:[E217 15:05:48.245141457 ProcessGroupNCCL.cpp:630] [Rank 112] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank112]:[E217 15:05:48.245147417 ProcessGroupNCCL.cpp:636] [Rank 112] To avoid data inconsistency, we are taking the entire process down. [rank80]:[E217 15:05:48.466045132 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 80] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank80]:[E217 15:05:48.466070283 ProcessGroupNCCL.cpp:630] [Rank 80] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank80]:[E217 15:05:48.466076053 ProcessGroupNCCL.cpp:636] [Rank 80] To avoid data inconsistency, we are taking the entire process down. [rank88]:[E217 15:05:48.606635382 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 88] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank88]:[E217 15:05:48.606653724 ProcessGroupNCCL.cpp:630] [Rank 88] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank88]:[E217 15:05:48.606659594 ProcessGroupNCCL.cpp:636] [Rank 88] To avoid data inconsistency, we are taking the entire process down. [rank64]:[E217 15:05:48.235277801 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 64] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank64]:[E217 15:05:48.235300771 ProcessGroupNCCL.cpp:630] [Rank 64] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank64]:[E217 15:05:48.235305681 ProcessGroupNCCL.cpp:636] [Rank 64] To avoid data inconsistency, we are taking the entire process down. [rank56]: Traceback (most recent call last): [rank56]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank56]: train(attn_implementation="flash_attention_2") [rank56]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank56]: trainer.train() [rank56]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank56]: return inner_training_loop( [rank56]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank56]: self._maybe_log_save_evaluate( [rank56]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank56]: self._save_checkpoint(model, trial) [rank56]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank56]: self._save_optimizer_and_scheduler(output_dir) [rank56]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank56]: self.model_wrapped.save_checkpoint(output_dir) [rank56]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank56]: self._create_zero_checkpoint_files(save_dir, tag) [rank56]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank56]: dist.barrier(group=self.optimizer.dp_process_group) [rank56]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank56]: return func(*args, **kwargs) [rank56]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank56]: return cdb.barrier(group=group, async_op=async_op) [rank56]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank56]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank56]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank56]: return func(*args, **kwargs) [rank56]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank56]: work = group.barrier(opts=opts) [rank56]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 56. [rank16]:[E217 15:05:48.701743049 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 16] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank16]:[E217 15:05:48.701769410 ProcessGroupNCCL.cpp:630] [Rank 16] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank16]:[E217 15:05:48.701776101 ProcessGroupNCCL.cpp:636] [Rank 16] To avoid data inconsistency, we are taking the entire process down. [rank104]:[E217 15:05:48.612340256 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 104] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank104]:[E217 15:05:48.612367157 ProcessGroupNCCL.cpp:630] [Rank 104] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank104]:[E217 15:05:48.612372637 ProcessGroupNCCL.cpp:636] [Rank 104] To avoid data inconsistency, we are taking the entire process down. [rank56]:[E217 15:05:48.119184057 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 56] Process group watchdog thread terminated with exception: [Rank 56] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e216db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f6dd6a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6dd6a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6dd6a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f6e222585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f6e26094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f6e26126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 56] Process group watchdog thread terminated with exception: [Rank 56] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e216db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f6dd6a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6dd6a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6dd6a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f6e222585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f6e26094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f6e26126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e216db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f6dd66a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f6e222585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f6e26094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f6e26126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank96]:[E217 15:05:48.121686028 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 96] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank96]:[E217 15:05:48.121714948 ProcessGroupNCCL.cpp:630] [Rank 96] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank96]:[E217 15:05:48.121721038 ProcessGroupNCCL.cpp:636] [Rank 96] To avoid data inconsistency, we are taking the entire process down. [rank80]:[E217 15:05:48.606676057 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 80] Process group watchdog thread terminated with exception: [Rank 80] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7aa3887b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7aa33da2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7aa33da31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7aa33da3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7aa3892555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7aa38d094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7aa38d126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 80] Process group watchdog thread terminated with exception: [Rank 80] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7aa3887b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7aa33da2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7aa33da31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7aa33da3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7aa3892555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7aa38d094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7aa38d126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7aa3887b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7aa33d6a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7aa3892555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7aa38d094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7aa38d126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank112]: Traceback (most recent call last): [rank112]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank112]: train(attn_implementation="flash_attention_2") [rank112]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank112]: trainer.train() [rank112]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank112]: return inner_training_loop( [rank112]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank112]: self._maybe_log_save_evaluate( [rank112]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank112]: self._save_checkpoint(model, trial) [rank112]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank112]: self._save_optimizer_and_scheduler(output_dir) [rank112]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank112]: self.model_wrapped.save_checkpoint(output_dir) [rank112]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank112]: self._create_zero_checkpoint_files(save_dir, tag) [rank112]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank112]: dist.barrier(group=self.optimizer.dp_process_group) [rank112]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank112]: return func(*args, **kwargs) [rank112]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank112]: return cdb.barrier(group=group, async_op=async_op) [rank112]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank112]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank112]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank112]: return func(*args, **kwargs) [rank112]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank112]: work = group.barrier(opts=opts) [rank112]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 112. [rank112]:[E217 15:05:48.455478162 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 112] Process group watchdog thread terminated with exception: [Rank 112] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e0bd916c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e0b8e82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e0b8e831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e0b8e83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e0bd9e5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e0bdde94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e0bddf26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 112] Process group watchdog thread terminated with exception: [Rank 112] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e0bd916c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e0b8e82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e0b8e831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e0b8e83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e0bd9e5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e0bdde94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e0bddf26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e0bd916c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7e0b8e4a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7e0bd9e5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7e0bdde94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7e0bddf26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank88]:[E217 15:05:48.782304822 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 88] Process group watchdog thread terminated with exception: [Rank 88] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70801f4e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x707fd482a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x707fd4831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x707fd483361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70801fc735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x708023e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x708023f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 88] Process group watchdog thread terminated with exception: [Rank 88] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70801f4e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x707fd482a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x707fd4831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x707fd483361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70801fc735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x708023e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x708023f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70801f4e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x707fd44a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x70801fc735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x708023e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x708023f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank16]:[E217 15:05:48.841509044 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 16] Process group watchdog thread terminated with exception: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b40464e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b3ffb82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b3ffb831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b3ffb83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b4046c735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b404ae94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b404af26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 16] Process group watchdog thread terminated with exception: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b40464e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b3ffb82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b3ffb831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b3ffb83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b4046c735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b404ae94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b404af26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b40464e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7b3ffb4a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7b4046c735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7b404ae94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7b404af26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank64]: Traceback (most recent call last): [rank64]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank64]: train(attn_implementation="flash_attention_2") [rank64]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank64]: trainer.train() [rank64]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank64]: return inner_training_loop( [rank64]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank64]: self._maybe_log_save_evaluate( [rank64]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank64]: self._save_checkpoint(model, trial) [rank64]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank64]: self._save_optimizer_and_scheduler(output_dir) [rank64]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank64]: self.model_wrapped.save_checkpoint(output_dir) [rank64]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank64]: self._create_zero_checkpoint_files(save_dir, tag) [rank64]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank64]: dist.barrier(group=self.optimizer.dp_process_group) [rank64]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank64]: return func(*args, **kwargs) [rank64]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank64]: return cdb.barrier(group=group, async_op=async_op) [rank64]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank64]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank64]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank64]: return func(*args, **kwargs) [rank64]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank64]: work = group.barrier(opts=opts) [rank64]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 64. [rank64]:[E217 15:05:48.393129452 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 64] Process group watchdog thread terminated with exception: [Rank 64] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b596fee7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b592522a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b5925231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b592523361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b5970a585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b5974894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b5974926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 64] Process group watchdog thread terminated with exception: [Rank 64] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b596fee7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b592522a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b5925231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b592523361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b5970a585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b5974894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b5974926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b596fee7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7b5924ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7b5970a585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7b5974894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7b5974926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank104]: Traceback (most recent call last): [rank104]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank104]: train(attn_implementation="flash_attention_2") [rank104]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank104]: trainer.train() [rank104]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank104]: return inner_training_loop( [rank104]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank104]: self._maybe_log_save_evaluate( [rank104]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank104]: self._save_checkpoint(model, trial) [rank104]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank104]: self._save_optimizer_and_scheduler(output_dir) [rank104]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank104]: self.model_wrapped.save_checkpoint(output_dir) [rank104]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank104]: self._create_zero_checkpoint_files(save_dir, tag) [rank104]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank104]: dist.barrier(group=self.optimizer.dp_process_group) [rank104]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank104]: return func(*args, **kwargs) [rank104]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank104]: return cdb.barrier(group=group, async_op=async_op) [rank104]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank104]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank104]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank104]: return func(*args, **kwargs) [rank104]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank104]: work = group.barrier(opts=opts) [rank104]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 104. [rank104]:[E217 15:05:48.773793060 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 104] Process group watchdog thread terminated with exception: [Rank 104] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7845dcee7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78459222a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x784592231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78459223361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7845dda515c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7845e1894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7845e1926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 104] Process group watchdog thread terminated with exception: [Rank 104] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7845dcee7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78459222a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x784592231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78459223361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7845dda515c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7845e1894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7845e1926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7845dcee7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x784591ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7845dda515c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7845e1894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7845e1926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank96]: Traceback (most recent call last): [rank96]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank96]: train(attn_implementation="flash_attention_2") [rank96]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank96]: trainer.train() [rank96]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank96]: return inner_training_loop( [rank96]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank96]: self._maybe_log_save_evaluate( [rank96]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank96]: self._save_checkpoint(model, trial) [rank96]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank96]: self._save_optimizer_and_scheduler(output_dir) [rank96]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank96]: self.model_wrapped.save_checkpoint(output_dir) [rank96]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank96]: self._create_zero_checkpoint_files(save_dir, tag) [rank96]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank96]: dist.barrier(group=self.optimizer.dp_process_group) [rank96]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank96]: return func(*args, **kwargs) [rank96]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank96]: return cdb.barrier(group=group, async_op=async_op) [rank96]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank96]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank96]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank96]: return func(*args, **kwargs) [rank96]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank96]: work = group.barrier(opts=opts) [rank96]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 96. [rank96]:[E217 15:05:49.313744813 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 96] Process group watchdog thread terminated with exception: [Rank 96] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600090 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78087e56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x780833c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x780833c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x780833c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78087f65d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x780883294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x780883326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 96] Process group watchdog thread terminated with exception: [Rank 96] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600090 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78087e56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x780833c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x780833c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x780833c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78087f65d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x780883294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x780883326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78087e56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7808338a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x78087f65d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x780883294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x780883326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank121]:[E217 15:05:49.616541336 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 121] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank121]:[E217 15:05:49.616569268 ProcessGroupNCCL.cpp:630] [Rank 121] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank121]:[E217 15:05:49.616574758 ProcessGroupNCCL.cpp:636] [Rank 121] To avoid data inconsistency, we are taking the entire process down. [rank121]:[E217 15:05:49.663825838 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 121] Process group watchdog thread terminated with exception: [Rank 121] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x785b31ae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x785ae6e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x785ae6e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x785ae6e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x785b322735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x785b36494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x785b36526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 121] Process group watchdog thread terminated with exception: [Rank 121] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x785b31ae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x785ae6e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x785ae6e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x785ae6e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x785b322735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x785b36494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x785b36526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x785b31ae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x785ae6aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x785b322735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x785b36494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x785b36526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank125]:[E217 15:05:49.771462504 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 125] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank125]:[E217 15:05:49.771487295 ProcessGroupNCCL.cpp:630] [Rank 125] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank125]:[E217 15:05:49.771493156 ProcessGroupNCCL.cpp:636] [Rank 125] To avoid data inconsistency, we are taking the entire process down. [rank127]:[E217 15:05:49.781307445 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 127] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank127]:[E217 15:05:49.781335537 ProcessGroupNCCL.cpp:630] [Rank 127] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank127]:[E217 15:05:49.781340977 ProcessGroupNCCL.cpp:636] [Rank 127] To avoid data inconsistency, we are taking the entire process down. [rank125]:[E217 15:05:49.790567595 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 125] Process group watchdog thread terminated with exception: [Rank 125] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b143d96c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b13f302a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b13f3031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b13f303361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b143ea535c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b1442494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b1442526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 125] Process group watchdog thread terminated with exception: [Rank 125] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b143d96c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b13f302a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b13f3031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b13f303361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b143ea535c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b1442494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b1442526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b143d96c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7b13f2ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7b143ea535c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7b1442494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7b1442526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank123]:[E217 15:05:49.808532966 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 123] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank123]:[E217 15:05:49.808550547 ProcessGroupNCCL.cpp:630] [Rank 123] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank123]:[E217 15:05:49.808555307 ProcessGroupNCCL.cpp:636] [Rank 123] To avoid data inconsistency, we are taking the entire process down. [rank127]:[E217 15:05:49.829710927 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 127] Process group watchdog thread terminated with exception: [Rank 127] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x724d11b2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x724cc6e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x724cc6e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x724cc6e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x724d122735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x724d16494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x724d16526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 127] Process group watchdog thread terminated with exception: [Rank 127] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x724d11b2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x724cc6e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x724cc6e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x724cc6e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x724d122735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x724d16494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x724d16526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x724d11b2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x724cc6aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x724d122735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x724d16494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x724d16526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank122]:[E217 15:05:49.846946499 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 122] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank122]:[E217 15:05:49.846970150 ProcessGroupNCCL.cpp:630] [Rank 122] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank122]:[E217 15:05:49.846975440 ProcessGroupNCCL.cpp:636] [Rank 122] To avoid data inconsistency, we are taking the entire process down. [rank124]:[E217 15:05:49.847958553 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 124] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank124]:[E217 15:05:49.847971023 ProcessGroupNCCL.cpp:630] [Rank 124] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank124]:[E217 15:05:49.847974963 ProcessGroupNCCL.cpp:636] [Rank 124] To avoid data inconsistency, we are taking the entire process down. [rank126]:[E217 15:05:49.855208236 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 126] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank126]:[E217 15:05:49.855224787 ProcessGroupNCCL.cpp:630] [Rank 126] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank126]:[E217 15:05:49.855229127 ProcessGroupNCCL.cpp:636] [Rank 126] To avoid data inconsistency, we are taking the entire process down. [rank122]:[E217 15:05:49.865188994 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 122] Process group watchdog thread terminated with exception: [Rank 122] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x763a2a4e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7639df82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7639df831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7639df83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x763a2ac735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x763a2ee94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x763a2ef26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 122] Process group watchdog thread terminated with exception: [Rank 122] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x763a2a4e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7639df82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7639df831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7639df83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x763a2ac735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x763a2ee94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x763a2ef26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x763a2a4e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7639df4a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x763a2ac735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x763a2ee94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x763a2ef26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank123]: Traceback (most recent call last): [rank123]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank123]: train(attn_implementation="flash_attention_2") [rank123]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank123]: trainer.train() [rank123]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank123]: return inner_training_loop( [rank123]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank123]: self._maybe_log_save_evaluate( [rank123]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank123]: self._save_checkpoint(model, trial) [rank123]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank123]: self._save_optimizer_and_scheduler(output_dir) [rank123]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank123]: self.model_wrapped.save_checkpoint(output_dir) [rank123]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank123]: self._create_zero_checkpoint_files(save_dir, tag) [rank123]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank123]: dist.barrier(group=self.optimizer.dp_process_group) [rank123]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank123]: return func(*args, **kwargs) [rank123]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank123]: return cdb.barrier(group=group, async_op=async_op) [rank123]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank123]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank123]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank123]: return func(*args, **kwargs) [rank123]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank123]: work = group.barrier(opts=opts) [rank123]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 123. [rank123]:[E217 15:05:49.870975420 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 123] Process group watchdog thread terminated with exception: [Rank 123] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c7eddae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c7e92e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c7e92e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c7e92e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c7ede6585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c7ee2494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c7ee2526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 123] Process group watchdog thread terminated with exception: [Rank 123] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c7eddae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c7e92e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c7e92e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c7e92e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c7ede6585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c7ee2494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c7ee2526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c7eddae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7c7e92aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7c7ede6585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7c7ee2494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7c7ee2526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank126]: Traceback (most recent call last): [rank126]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank126]: train(attn_implementation="flash_attention_2") [rank126]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank126]: trainer.train() [rank126]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank126]: return inner_training_loop( [rank126]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank126]: self._maybe_log_save_evaluate( [rank126]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank126]: self._save_checkpoint(model, trial) [rank126]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank126]: self._save_optimizer_and_scheduler(output_dir) [rank126]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank126]: self.model_wrapped.save_checkpoint(output_dir) [rank126]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank126]: self._create_zero_checkpoint_files(save_dir, tag) [rank126]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank126]: dist.barrier(group=self.optimizer.dp_process_group) [rank126]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank126]: return func(*args, **kwargs) [rank126]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank126]: return cdb.barrier(group=group, async_op=async_op) [rank126]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank126]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank126]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank126]: return func(*args, **kwargs) [rank126]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank126]: work = group.barrier(opts=opts) [rank126]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 126. [rank124]: Traceback (most recent call last): [rank124]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank124]: train(attn_implementation="flash_attention_2") [rank124]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank124]: trainer.train() [rank124]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank124]: return inner_training_loop( [rank124]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank124]: self._maybe_log_save_evaluate( [rank124]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank124]: self._save_checkpoint(model, trial) [rank124]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank124]: self._save_optimizer_and_scheduler(output_dir) [rank124]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank124]: self.model_wrapped.save_checkpoint(output_dir) [rank124]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank124]: self._create_zero_checkpoint_files(save_dir, tag) [rank124]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank124]: dist.barrier(group=self.optimizer.dp_process_group) [rank124]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank124]: return func(*args, **kwargs) [rank124]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank124]: return cdb.barrier(group=group, async_op=async_op) [rank124]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank124]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank124]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank124]: return func(*args, **kwargs) [rank124]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank124]: work = group.barrier(opts=opts) [rank124]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 124. [rank126]:[E217 15:05:49.983926478 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 126] Process group watchdog thread terminated with exception: [Rank 126] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x727514f2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7274ca22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7274ca231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7274ca23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72751566d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x727519894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x727519926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 126] Process group watchdog thread terminated with exception: [Rank 126] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x727514f2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7274ca22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7274ca231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7274ca23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72751566d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x727519894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x727519926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x727514f2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7274c9ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x72751566d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x727519894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x727519926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank124]:[E217 15:05:49.987759051 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 124] Process group watchdog thread terminated with exception: [Rank 124] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x794c77176446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x794c2c42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x794c2c431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x794c2c43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x794c778585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x794c7ba94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x794c7bb26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 124] Process group watchdog thread terminated with exception: [Rank 124] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x794c77176446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x794c2c42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x794c2c431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x794c2c43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x794c778585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x794c7ba94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x794c7bb26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x794c77176446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x794c2c0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x794c778585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x794c7ba94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x794c7bb26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank54]:[E217 15:05:50.939275523 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 54] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank54]:[E217 15:05:50.939301004 ProcessGroupNCCL.cpp:630] [Rank 54] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank54]:[E217 15:05:50.939306064 ProcessGroupNCCL.cpp:636] [Rank 54] To avoid data inconsistency, we are taking the entire process down. [rank50]:[E217 15:05:50.940455878 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 50] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank50]:[E217 15:05:50.940485299 ProcessGroupNCCL.cpp:630] [Rank 50] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank50]:[E217 15:05:50.940490869 ProcessGroupNCCL.cpp:636] [Rank 50] To avoid data inconsistency, we are taking the entire process down. [rank54]: Traceback (most recent call last): [rank54]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank54]: train(attn_implementation="flash_attention_2") [rank54]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank54]: trainer.train() [rank54]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank54]: return inner_training_loop( [rank54]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank54]: self._maybe_log_save_evaluate( [rank54]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank54]: self._save_checkpoint(model, trial) [rank54]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank54]: self._save_optimizer_and_scheduler(output_dir) [rank54]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank54]: self.model_wrapped.save_checkpoint(output_dir) [rank54]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank54]: self._create_zero_checkpoint_files(save_dir, tag) [rank54]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank54]: dist.barrier(group=self.optimizer.dp_process_group) [rank54]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank54]: return func(*args, **kwargs) [rank54]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank54]: return cdb.barrier(group=group, async_op=async_op) [rank54]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank54]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank54]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank54]: return func(*args, **kwargs) [rank54]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank54]: work = group.barrier(opts=opts) [rank54]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 54. [rank50]: Traceback (most recent call last): [rank50]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank50]: train(attn_implementation="flash_attention_2") [rank50]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank50]: trainer.train() [rank50]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank50]: return inner_training_loop( [rank50]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank50]: self._maybe_log_save_evaluate( [rank50]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank50]: self._save_checkpoint(model, trial) [rank50]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank50]: self._save_optimizer_and_scheduler(output_dir) [rank50]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank50]: self.model_wrapped.save_checkpoint(output_dir) [rank50]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank50]: self._create_zero_checkpoint_files(save_dir, tag) [rank50]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank50]: dist.barrier(group=self.optimizer.dp_process_group) [rank50]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank50]: return func(*args, **kwargs) [rank50]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank50]: return cdb.barrier(group=group, async_op=async_op) [rank50]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank50]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank50]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank50]: return func(*args, **kwargs) [rank50]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank50]: work = group.barrier(opts=opts) [rank50]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 50. [rank50]:[E217 15:05:50.007388741 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 50] Process group watchdog thread terminated with exception: [Rank 50] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x797406b76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7973bbe2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7973bbe31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7973bbe3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7974072585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x79740b494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x79740b526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 50] Process group watchdog thread terminated with exception: [Rank 50] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x797406b76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7973bbe2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7973bbe31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7973bbe3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7974072585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x79740b494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x79740b526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x797406b76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7973bbaa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7974072585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x79740b494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x79740b526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank54]:[E217 15:05:50.014136695 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 54] Process group watchdog thread terminated with exception: [Rank 54] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x731920993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7318d5c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7318d5c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7318d5c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x731920aee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x731925294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x731925326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 54] Process group watchdog thread terminated with exception: [Rank 54] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x731920993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7318d5c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7318d5c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7318d5c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x731920aee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x731925294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x731925326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x731920993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7318d58a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x731920aee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x731925294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x731925326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank45]:[E217 15:05:50.350120678 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 45] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank45]:[E217 15:05:50.350152039 ProcessGroupNCCL.cpp:630] [Rank 45] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank45]:[E217 15:05:50.350157860 ProcessGroupNCCL.cpp:636] [Rank 45] To avoid data inconsistency, we are taking the entire process down. [rank45]:[E217 15:05:50.352133030 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 45] Process group watchdog thread terminated with exception: [Rank 45] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75838416c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75833982a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x758339831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75833983361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7583852565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x758388e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x758388f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 45] Process group watchdog thread terminated with exception: [Rank 45] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75838416c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75833982a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x758339831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75833983361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7583852565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x758388e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x758388f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75838416c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7583394a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7583852565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x758388e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x758388f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank11]:[E217 15:05:50.354109204 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 11] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank11]:[E217 15:05:50.354139124 ProcessGroupNCCL.cpp:630] [Rank 11] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank11]:[E217 15:05:50.354144375 ProcessGroupNCCL.cpp:636] [Rank 11] To avoid data inconsistency, we are taking the entire process down. [rank11]: Traceback (most recent call last): [rank11]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank11]: train(attn_implementation="flash_attention_2") [rank11]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank11]: trainer.train() [rank11]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank11]: return inner_training_loop( [rank11]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank11]: self._maybe_log_save_evaluate( [rank11]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank11]: self._save_checkpoint(model, trial) [rank11]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank11]: self._save_optimizer_and_scheduler(output_dir) [rank11]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank11]: self.model_wrapped.save_checkpoint(output_dir) [rank11]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank11]: self._create_zero_checkpoint_files(save_dir, tag) [rank11]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank11]: dist.barrier(group=self.optimizer.dp_process_group) [rank11]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank11]: return func(*args, **kwargs) [rank11]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank11]: return cdb.barrier(group=group, async_op=async_op) [rank11]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank11]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank11]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank11]: return func(*args, **kwargs) [rank11]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank11]: work = group.barrier(opts=opts) [rank11]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 11. [rank11]:[E217 15:05:50.416926478 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 11] Process group watchdog thread terminated with exception: [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600065 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7764ed36c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7764a2a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7764a2a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7764a2a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7764ee4565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7764f2094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7764f2126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 11] Process group watchdog thread terminated with exception: [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600065 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7764ed36c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7764a2a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7764a2a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7764a2a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7764ee4565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7764f2094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7764f2126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7764ed36c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7764a26a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7764ee4565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7764f2094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7764f2126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank49]:[E217 15:05:50.162111452 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 49] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank49]:[E217 15:05:50.162134632 ProcessGroupNCCL.cpp:630] [Rank 49] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank49]:[E217 15:05:50.162139572 ProcessGroupNCCL.cpp:636] [Rank 49] To avoid data inconsistency, we are taking the entire process down. [rank29]:[E217 15:05:50.270130809 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 29] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank29]:[E217 15:05:50.270161051 ProcessGroupNCCL.cpp:630] [Rank 29] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank29]:[E217 15:05:50.270166171 ProcessGroupNCCL.cpp:636] [Rank 29] To avoid data inconsistency, we are taking the entire process down. [rank51]:[E217 15:05:50.167679351 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 51] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank51]:[E217 15:05:50.167705771 ProcessGroupNCCL.cpp:630] [Rank 51] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank51]:[E217 15:05:50.167711071 ProcessGroupNCCL.cpp:636] [Rank 51] To avoid data inconsistency, we are taking the entire process down. [rank53]:[E217 15:05:50.171777069 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 53] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank53]:[E217 15:05:50.171792359 ProcessGroupNCCL.cpp:630] [Rank 53] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank53]:[E217 15:05:50.171798889 ProcessGroupNCCL.cpp:636] [Rank 53] To avoid data inconsistency, we are taking the entire process down. [rank55]:[E217 15:05:50.174823244 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 55] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank55]:[E217 15:05:50.174849124 ProcessGroupNCCL.cpp:630] [Rank 55] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank55]:[E217 15:05:50.174854454 ProcessGroupNCCL.cpp:636] [Rank 55] To avoid data inconsistency, we are taking the entire process down. [rank47]:[E217 15:05:50.536505122 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 47] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank47]:[E217 15:05:50.536530633 ProcessGroupNCCL.cpp:630] [Rank 47] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank47]:[E217 15:05:50.536535824 ProcessGroupNCCL.cpp:636] [Rank 47] To avoid data inconsistency, we are taking the entire process down. [rank29]:[E217 15:05:50.318020759 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 29] Process group watchdog thread terminated with exception: [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71f8fa56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x71f8afc2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x71f8afc31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x71f8afc3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71f8fa9635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x71f8ff094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x71f8ff126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 29] Process group watchdog thread terminated with exception: [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71f8fa56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x71f8afc2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x71f8afc31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x71f8afc3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71f8fa9635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x71f8ff094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x71f8ff126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71f8fa56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x71f8af8a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x71f8fa9635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x71f8ff094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x71f8ff126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank49]: Traceback (most recent call last): [rank49]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank49]: train(attn_implementation="flash_attention_2") [rank49]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank49]: trainer.train() [rank49]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank49]: return inner_training_loop( [rank49]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank49]: self._maybe_log_save_evaluate( [rank49]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank49]: self._save_checkpoint(model, trial) [rank49]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank49]: self._save_optimizer_and_scheduler(output_dir) [rank49]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank49]: self.model_wrapped.save_checkpoint(output_dir) [rank49]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank49]: self._create_zero_checkpoint_files(save_dir, tag) [rank49]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank49]: dist.barrier(group=self.optimizer.dp_process_group) [rank49]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank49]: return func(*args, **kwargs) [rank49]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank49]: return cdb.barrier(group=group, async_op=async_op) [rank49]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank49]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank49]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank49]: return func(*args, **kwargs) [rank49]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank49]: work = group.barrier(opts=opts) [rank49]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 49. [rank43]:[E217 15:05:50.547030661 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 43] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank43]:[E217 15:05:50.547061303 ProcessGroupNCCL.cpp:630] [Rank 43] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank43]:[E217 15:05:50.547070534 ProcessGroupNCCL.cpp:636] [Rank 43] To avoid data inconsistency, we are taking the entire process down. [rank53]:[E217 15:05:50.219256025 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 53] Process group watchdog thread terminated with exception: [Rank 53] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff0eabb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ff09fe2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff09fe31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ff09fe3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ff0eb66c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ff0ef494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ff0ef526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 53] Process group watchdog thread terminated with exception: [Rank 53] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff0eabb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ff09fe2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff09fe31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ff09fe3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ff0eb66c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ff0ef494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ff0ef526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff0eabb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7ff09faa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7ff0eb66c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7ff0ef494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7ff0ef526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank51]: Traceback (most recent call last): [rank51]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank51]: train(attn_implementation="flash_attention_2") [rank51]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank51]: trainer.train() [rank51]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank51]: return inner_training_loop( [rank51]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank51]: self._maybe_log_save_evaluate( [rank51]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank51]: self._save_checkpoint(model, trial) [rank51]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank51]: self._save_optimizer_and_scheduler(output_dir) [rank51]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank51]: self.model_wrapped.save_checkpoint(output_dir) [rank51]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank51]: self._create_zero_checkpoint_files(save_dir, tag) [rank51]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank51]: dist.barrier(group=self.optimizer.dp_process_group) [rank51]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank51]: return func(*args, **kwargs) [rank51]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank51]: return cdb.barrier(group=group, async_op=async_op) [rank51]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank51]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank51]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank51]: return func(*args, **kwargs) [rank51]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank51]: work = group.barrier(opts=opts) [rank51]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 51. [rank55]:[E217 15:05:50.223267890 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 55] Process group watchdog thread terminated with exception: [Rank 55] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75f245d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75f1fb42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75f1fb431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75f1fb43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75f2461635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75f24a894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75f24a926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 55] Process group watchdog thread terminated with exception: [Rank 55] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75f245d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75f1fb42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75f1fb431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75f1fb43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75f2461635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75f24a894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75f24a926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75f245d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x75f1fb0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x75f2461635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x75f24a894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x75f24a926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank49]:[E217 15:05:50.228990103 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 49] Process group watchdog thread terminated with exception: [Rank 49] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x728d03993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x728cb8c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x728cb8c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x728cb8c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x728d03aee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x728d08294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x728d08326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 49] Process group watchdog thread terminated with exception: [Rank 49] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x728d03993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x728cb8c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x728cb8c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x728cb8c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x728d03aee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x728d08294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x728d08326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x728d03993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x728cb88a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x728d03aee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x728d08294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x728d08326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank13]:[E217 15:05:50.507117162 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 13] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank13]:[E217 15:05:50.507140423 ProcessGroupNCCL.cpp:630] [Rank 13] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank13]:[E217 15:05:50.507145923 ProcessGroupNCCL.cpp:636] [Rank 13] To avoid data inconsistency, we are taking the entire process down. [rank51]:[E217 15:05:50.245178979 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 51] Process group watchdog thread terminated with exception: [Rank 51] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a467fdb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a463502a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a4635031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a463503361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a46808665c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a4684694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a4684726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 51] Process group watchdog thread terminated with exception: [Rank 51] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a467fdb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a463502a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a4635031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a463503361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a46808665c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a4684694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a4684726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a467fdb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7a4634ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7a46808665c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7a4684694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7a4684726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank47]:[E217 15:05:50.585317595 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 47] Process group watchdog thread terminated with exception: [Rank 47] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600089 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d572ad93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7d56e002a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7d56e0031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7d56e003361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7d572aeee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7d572f694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7d572f726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 47] Process group watchdog thread terminated with exception: [Rank 47] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600089 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d572ad93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7d56e002a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7d56e0031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7d56e003361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7d572aeee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7d572f694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7d572f726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d572ad93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7d56dfca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7d572aeee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7d572f694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7d572f726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank77]:[E217 15:05:50.511493526 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 77] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank77]:[E217 15:05:50.511522407 ProcessGroupNCCL.cpp:630] [Rank 77] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank77]:[E217 15:05:50.511527428 ProcessGroupNCCL.cpp:636] [Rank 77] To avoid data inconsistency, we are taking the entire process down. [rank52]:[E217 15:05:50.265712229 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 52] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank52]:[E217 15:05:50.265738389 ProcessGroupNCCL.cpp:630] [Rank 52] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank52]:[E217 15:05:50.265744460 ProcessGroupNCCL.cpp:636] [Rank 52] To avoid data inconsistency, we are taking the entire process down. [rank43]: Traceback (most recent call last): [rank43]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank43]: train(attn_implementation="flash_attention_2") [rank43]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank43]: trainer.train() [rank43]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank43]: return inner_training_loop( [rank43]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank43]: self._maybe_log_save_evaluate( [rank43]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank43]: self._save_checkpoint(model, trial) [rank43]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank43]: self._save_optimizer_and_scheduler(output_dir) [rank43]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank43]: self.model_wrapped.save_checkpoint(output_dir) [rank43]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank43]: self._create_zero_checkpoint_files(save_dir, tag) [rank43]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank43]: dist.barrier(group=self.optimizer.dp_process_group) [rank43]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank43]: return func(*args, **kwargs) [rank43]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank43]: return cdb.barrier(group=group, async_op=async_op) [rank43]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank43]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank43]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank43]: return func(*args, **kwargs) [rank43]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank43]: work = group.barrier(opts=opts) [rank43]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 43. [rank36]:[E217 15:05:50.555780320 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 36] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank36]:[E217 15:05:50.555808982 ProcessGroupNCCL.cpp:630] [Rank 36] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank36]:[E217 15:05:50.555813732 ProcessGroupNCCL.cpp:636] [Rank 36] To avoid data inconsistency, we are taking the entire process down. [rank15]:[E217 15:05:50.536058068 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 15] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank15]:[E217 15:05:50.536086708 ProcessGroupNCCL.cpp:630] [Rank 15] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank15]:[E217 15:05:50.536092248 ProcessGroupNCCL.cpp:636] [Rank 15] To avoid data inconsistency, we are taking the entire process down. [rank42]:[E217 15:05:50.603828111 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 42] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank42]:[E217 15:05:50.603850762 ProcessGroupNCCL.cpp:630] [Rank 42] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank42]:[E217 15:05:50.603855783 ProcessGroupNCCL.cpp:636] [Rank 42] To avoid data inconsistency, we are taking the entire process down. [rank15]:[E217 15:05:50.537982222 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79df81376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x79df3662a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x79df36631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x79df3663361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x79df81e555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x79df85c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x79df85d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79df81376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x79df3662a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x79df36631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x79df3663361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x79df81e555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x79df85c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x79df85d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79df81376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x79df362a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x79df81e555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x79df85c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x79df85d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank44]:[E217 15:05:50.608104020 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 44] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank44]:[E217 15:05:50.608127002 ProcessGroupNCCL.cpp:630] [Rank 44] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank44]:[E217 15:05:50.608131812 ProcessGroupNCCL.cpp:636] [Rank 44] To avoid data inconsistency, we are taking the entire process down. [rank116]:[E217 15:05:50.467713374 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 116] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank116]:[E217 15:05:50.467743826 ProcessGroupNCCL.cpp:630] [Rank 116] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank116]:[E217 15:05:50.467749446 ProcessGroupNCCL.cpp:636] [Rank 116] To avoid data inconsistency, we are taking the entire process down. [rank41]:[E217 15:05:50.608249319 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 41] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank41]:[E217 15:05:50.608265409 ProcessGroupNCCL.cpp:630] [Rank 41] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank41]:[E217 15:05:50.608269570 ProcessGroupNCCL.cpp:636] [Rank 41] To avoid data inconsistency, we are taking the entire process down. [rank10]:[E217 15:05:50.541953533 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 10] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank10]:[E217 15:05:50.541975684 ProcessGroupNCCL.cpp:630] [Rank 10] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank10]:[E217 15:05:50.541980664 ProcessGroupNCCL.cpp:636] [Rank 10] To avoid data inconsistency, we are taking the entire process down. [rank46]:[E217 15:05:50.616151641 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 46] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank46]:[E217 15:05:50.616166072 ProcessGroupNCCL.cpp:630] [Rank 46] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank46]:[E217 15:05:50.616172482 ProcessGroupNCCL.cpp:636] [Rank 46] To avoid data inconsistency, we are taking the entire process down. [rank13]: Traceback (most recent call last): [rank13]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank13]: train(attn_implementation="flash_attention_2") [rank13]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank13]: trainer.train() [rank13]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank13]: return inner_training_loop( [rank13]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank13]: self._maybe_log_save_evaluate( [rank13]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank13]: self._save_checkpoint(model, trial) [rank13]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank13]: self._save_optimizer_and_scheduler(output_dir) [rank13]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank13]: self.model_wrapped.save_checkpoint(output_dir) [rank13]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank13]: self._create_zero_checkpoint_files(save_dir, tag) [rank13]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank13]: dist.barrier(group=self.optimizer.dp_process_group) [rank13]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank13]: return func(*args, **kwargs) [rank13]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank13]: return cdb.barrier(group=group, async_op=async_op) [rank13]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank13]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank13]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank13]: return func(*args, **kwargs) [rank13]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank13]: work = group.barrier(opts=opts) [rank13]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 13. [rank38]:[E217 15:05:50.584538853 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 38] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank38]:[E217 15:05:50.584568604 ProcessGroupNCCL.cpp:630] [Rank 38] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank38]:[E217 15:05:50.584573835 ProcessGroupNCCL.cpp:636] [Rank 38] To avoid data inconsistency, we are taking the entire process down. [rank13]:[E217 15:05:50.570693914 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cb05016c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7cb00582a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cb005831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cb00583361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7cb0505e45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7cb054e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7cb054f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cb05016c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7cb00582a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cb005831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cb00583361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7cb0505e45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7cb054e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7cb054f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cb05016c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7cb0054a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7cb0505e45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7cb054e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7cb054f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank74]:[E217 15:05:50.559858902 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 74] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank74]:[E217 15:05:50.559883203 ProcessGroupNCCL.cpp:630] [Rank 74] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank74]:[E217 15:05:50.559888753 ProcessGroupNCCL.cpp:636] [Rank 74] To avoid data inconsistency, we are taking the entire process down. [rank75]:[E217 15:05:50.559880213 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 75] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank75]:[E217 15:05:50.559900974 ProcessGroupNCCL.cpp:630] [Rank 75] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank75]:[E217 15:05:50.559907625 ProcessGroupNCCL.cpp:636] [Rank 75] To avoid data inconsistency, we are taking the entire process down. [rank52]:[E217 15:05:50.312042130 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 52] Process group watchdog thread terminated with exception: [Rank 52] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600037 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d8302576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7d82b782a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7d82b7831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7d82b783361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7d8302c585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7d8306e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7d8306f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 52] Process group watchdog thread terminated with exception: [Rank 52] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600037 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d8302576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7d82b782a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7d82b7831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7d82b783361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7d8302c585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7d8306e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7d8306f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d8302576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7d82b74a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7d8302c585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7d8306e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7d8306f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank27]:[E217 15:05:50.421170456 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 27] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank27]:[E217 15:05:50.421200528 ProcessGroupNCCL.cpp:630] [Rank 27] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank27]:[E217 15:05:50.421206088 ProcessGroupNCCL.cpp:636] [Rank 27] To avoid data inconsistency, we are taking the entire process down. [rank36]:[E217 15:05:50.600186696 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 36] Process group watchdog thread terminated with exception: [Rank 36] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600026 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7665e0ae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x766595e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x766595e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x766595e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7665e12735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7665e5494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7665e5526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 36] Process group watchdog thread terminated with exception: [Rank 36] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600026 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7665e0ae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x766595e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x766595e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x766595e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7665e12735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7665e5494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7665e5526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7665e0ae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x766595aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7665e12735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7665e5494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7665e5526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank77]: Traceback (most recent call last): [rank77]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank77]: train(attn_implementation="flash_attention_2") [rank77]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank77]: trainer.train() [rank77]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank77]: return inner_training_loop( [rank77]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank77]: self._maybe_log_save_evaluate( [rank77]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank77]: self._save_checkpoint(model, trial) [rank77]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank77]: self._save_optimizer_and_scheduler(output_dir) [rank77]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank77]: self.model_wrapped.save_checkpoint(output_dir) [rank77]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank77]: self._create_zero_checkpoint_files(save_dir, tag) [rank77]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank77]: dist.barrier(group=self.optimizer.dp_process_group) [rank77]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank77]: return func(*args, **kwargs) [rank77]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank77]: return cdb.barrier(group=group, async_op=async_op) [rank77]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank77]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank77]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank77]: return func(*args, **kwargs) [rank77]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank77]: work = group.barrier(opts=opts) [rank77]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 77. [rank31]:[E217 15:05:50.423836360 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 31] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank31]:[E217 15:05:50.423863411 ProcessGroupNCCL.cpp:630] [Rank 31] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank31]:[E217 15:05:50.423868621 ProcessGroupNCCL.cpp:636] [Rank 31] To avoid data inconsistency, we are taking the entire process down. [rank25]:[E217 15:05:50.425285537 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 25] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank25]:[E217 15:05:50.425309798 ProcessGroupNCCL.cpp:630] [Rank 25] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank25]:[E217 15:05:50.425316088 ProcessGroupNCCL.cpp:636] [Rank 25] To avoid data inconsistency, we are taking the entire process down. [rank9]:[E217 15:05:50.584546423 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 9] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank9]:[E217 15:05:50.584569643 ProcessGroupNCCL.cpp:630] [Rank 9] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank9]:[E217 15:05:50.584574473 ProcessGroupNCCL.cpp:636] [Rank 9] To avoid data inconsistency, we are taking the entire process down. [rank44]: Traceback (most recent call last): [rank44]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank44]: train(attn_implementation="flash_attention_2") [rank44]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank44]: trainer.train() [rank44]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank44]: return inner_training_loop( [rank44]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank44]: self._maybe_log_save_evaluate( [rank44]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank44]: self._save_checkpoint(model, trial) [rank44]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank44]: self._save_optimizer_and_scheduler(output_dir) [rank44]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank44]: self.model_wrapped.save_checkpoint(output_dir) [rank44]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank44]: self._create_zero_checkpoint_files(save_dir, tag) [rank44]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank44]: dist.barrier(group=self.optimizer.dp_process_group) [rank44]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank44]: return func(*args, **kwargs) [rank44]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank44]: return cdb.barrier(group=group, async_op=async_op) [rank44]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank44]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank44]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank44]: return func(*args, **kwargs) [rank44]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank44]: work = group.barrier(opts=opts) [rank44]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 44. [rank41]: Traceback (most recent call last): [rank41]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank41]: train(attn_implementation="flash_attention_2") [rank41]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank41]: trainer.train() [rank41]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank41]: return inner_training_loop( [rank41]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank41]: self._maybe_log_save_evaluate( [rank41]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank41]: self._save_checkpoint(model, trial) [rank10]: Traceback (most recent call last): [rank10]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank10]: train(attn_implementation="flash_attention_2") [rank10]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank10]: trainer.train() [rank10]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank10]: return inner_training_loop( [rank10]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank10]: self._maybe_log_save_evaluate( [rank10]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank10]: self._save_checkpoint(model, trial) [rank41]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank41]: self._save_optimizer_and_scheduler(output_dir) [rank41]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank41]: self.model_wrapped.save_checkpoint(output_dir) [rank41]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank41]: self._create_zero_checkpoint_files(save_dir, tag) [rank41]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank41]: dist.barrier(group=self.optimizer.dp_process_group) [rank41]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank41]: return func(*args, **kwargs) [rank10]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank10]: self._save_optimizer_and_scheduler(output_dir) [rank10]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank10]: self.model_wrapped.save_checkpoint(output_dir) [rank10]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank10]: self._create_zero_checkpoint_files(save_dir, tag) [rank10]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank10]: dist.barrier(group=self.optimizer.dp_process_group) [rank10]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank10]: return func(*args, **kwargs) [rank41]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank41]: return cdb.barrier(group=group, async_op=async_op) [rank41]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank41]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank41]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank41]: return func(*args, **kwargs) [rank41]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank41]: work = group.barrier(opts=opts) [rank41]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 41. [rank10]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank10]: return cdb.barrier(group=group, async_op=async_op) [rank10]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank10]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank10]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank10]: return func(*args, **kwargs) [rank10]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank10]: work = group.barrier(opts=opts) [rank10]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 10. [rank12]:[E217 15:05:50.590041769 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 12] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank12]:[E217 15:05:50.590061910 ProcessGroupNCCL.cpp:630] [Rank 12] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank12]:[E217 15:05:50.590066550 ProcessGroupNCCL.cpp:636] [Rank 12] To avoid data inconsistency, we are taking the entire process down. [rank77]:[E217 15:05:50.576300892 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 77] Process group watchdog thread terminated with exception: [Rank 77] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2a4c6b5446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f2a01a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f2a01a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f2a01a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f2a4c81c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f2a51094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f2a51126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 77] Process group watchdog thread terminated with exception: [Rank 77] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2a4c6b5446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f2a01a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f2a01a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f2a01a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f2a4c81c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f2a51094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f2a51126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2a4c6b5446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f2a016a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f2a4c81c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f2a51094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f2a51126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank105]:[E217 15:05:50.751659399 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 105] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank105]:[E217 15:05:50.751686379 ProcessGroupNCCL.cpp:630] [Rank 105] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank105]:[E217 15:05:50.751691369 ProcessGroupNCCL.cpp:636] [Rank 105] To avoid data inconsistency, we are taking the entire process down. [rank14]:[E217 15:05:50.591725408 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 14] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank14]:[E217 15:05:50.591746308 ProcessGroupNCCL.cpp:630] [Rank 14] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank14]:[E217 15:05:50.591750619 ProcessGroupNCCL.cpp:636] [Rank 14] To avoid data inconsistency, we are taking the entire process down. [rank41]:[E217 15:05:50.658260528 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 41] Process group watchdog thread terminated with exception: [Rank 41] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600082 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75bfbc350446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75bf7162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75bf71631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75bf7163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75bfbc4ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75bfc0c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75bfc0d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank42]: Traceback (most recent call last): [rank42]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank42]: train(attn_implementation="flash_attention_2") [rank42]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank42]: trainer.train() [rank42]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank42]: return inner_training_loop( [rank42]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank42]: self._maybe_log_save_evaluate( [rank42]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank42]: self._save_checkpoint(model, trial) [rank42]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank42]: self._save_optimizer_and_scheduler(output_dir) [rank42]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank42]: self.model_wrapped.save_checkpoint(output_dir) [rank42]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank42]: self._create_zero_checkpoint_files(save_dir, tag) [rank42]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank42]: dist.barrier(group=self.optimizer.dp_process_group) [rank42]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank42]: return func(*args, **kwargs) [rank42]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank42]: return cdb.barrier(group=group, async_op=async_op) [rank42]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank42]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank42]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank42]: return func(*args, **kwargs) [rank42]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank42]: work = group.barrier(opts=opts) [rank42]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 42. what(): [PG ID 1 PG GUID 1 Rank 41] Process group watchdog thread terminated with exception: [Rank 41] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600082 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75bfbc350446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75bf7162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75bf71631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75bf7163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75bfbc4ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75bfc0c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75bfc0d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75bfbc350446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x75bf712a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x75bfbc4ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x75bfc0c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x75bfc0d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank34]:[E217 15:05:50.613489285 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 34] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank34]:[E217 15:05:50.613506766 ProcessGroupNCCL.cpp:630] [Rank 34] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank34]:[E217 15:05:50.613511687 ProcessGroupNCCL.cpp:636] [Rank 34] To avoid data inconsistency, we are taking the entire process down. [rank116]: Traceback (most recent call last): [rank116]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank116]: train(attn_implementation="flash_attention_2") [rank116]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank116]: trainer.train() [rank116]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank116]: return inner_training_loop( [rank116]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank116]: self._maybe_log_save_evaluate( [rank116]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank116]: self._save_checkpoint(model, trial) [rank116]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank116]: self._save_optimizer_and_scheduler(output_dir) [rank116]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank116]: self.model_wrapped.save_checkpoint(output_dir) [rank116]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank116]: self._create_zero_checkpoint_files(save_dir, tag) [rank116]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank116]: dist.barrier(group=self.optimizer.dp_process_group) [rank116]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank116]: return func(*args, **kwargs) [rank116]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank116]: return cdb.barrier(group=group, async_op=async_op) [rank116]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank116]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank116]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank116]: return func(*args, **kwargs) [rank116]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank116]: work = group.barrier(opts=opts) [rank116]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 116. [rank46]: Traceback (most recent call last): [rank46]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank46]: train(attn_implementation="flash_attention_2") [rank46]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank46]: trainer.train() [rank46]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank46]: return inner_training_loop( [rank46]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank46]: self._maybe_log_save_evaluate( [rank46]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank46]: self._save_checkpoint(model, trial) [rank46]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank46]: self._save_optimizer_and_scheduler(output_dir) [rank46]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank46]: self.model_wrapped.save_checkpoint(output_dir) [rank46]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank46]: self._create_zero_checkpoint_files(save_dir, tag) [rank46]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank46]: dist.barrier(group=self.optimizer.dp_process_group) [rank46]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank46]: return func(*args, **kwargs) [rank46]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank46]: return cdb.barrier(group=group, async_op=async_op) [rank46]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank46]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank46]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank46]: return func(*args, **kwargs) [rank46]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank46]: work = group.barrier(opts=opts) [rank46]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 46. [rank42]:[E217 15:05:50.669161569 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 42] Process group watchdog thread terminated with exception: [Rank 42] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e9ebbb6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e9e7122a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e9e71231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e9e7123361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e9ebc85c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e9ec0894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e9ec0926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 42] Process group watchdog thread terminated with exception: [Rank 42] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e9ebbb6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e9e7122a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e9e71231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank9]:[E217 15:05:50.603028568 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 9] Process group watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600064 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77b675d2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77b62b02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77b62b031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e9e7123361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e9ebc85c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e9ec0894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e9ec0926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e9ebbb6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7e9e70ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77b62b03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77b6764735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77b67a694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77b67a726850 in /lib/x86_64-linux-gnu/libc.so.6) frame #2: + 0x145c0 (0x7e9ebc85c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7e9ec0894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7e9ec0926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 9] Process group watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600064 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77b675d2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77b62b02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77b62b031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77b62b03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77b6764735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77b67a694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77b67a726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77b675d2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x77b62aca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x77b6764735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x77b67a694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x77b67a726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank74]:[E217 15:05:50.596256581 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 74] Process group watchdog thread terminated with exception: [Rank 74] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f1c63ce7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f1c1902a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f1c19031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f1c1903361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f1c644735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f1c68694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f1c68726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 74] Process group watchdog thread terminated with exception: [Rank 74] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f1c63ce7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f1c1902a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f1c19031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f1c1903361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f1c644735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f1c68694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f1c68726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f1c63ce7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f1c18ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f1c644735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f1c68694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f1c68726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank46]:[E217 15:05:50.677628613 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 46] Process group watchdog thread terminated with exception: [Rank 46] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75467856c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75462dc2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75462dc31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75462dc3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7546789e45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75467d294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75467d326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 46] Process group watchdog thread terminated with exception: [Rank 46] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75467856c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75462dc2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75462dc31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75462dc3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7546789e45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75467d294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75467d326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75467856c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x75462d8a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7546789e45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x75467d294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x75467d326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank10]:[E217 15:05:50.614104772 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 10] Process group watchdog thread terminated with exception: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70209416c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70204982a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x702049831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70204983361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x702094e5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x702098e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x702098f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 10] Process group watchdog thread terminated with exception: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70209416c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70204982a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x702049831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70204983361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x702094e5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x702098e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x702098f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70209416c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7020494a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank43]:[E217 15:05:50.681119608 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 43] Process group watchdog thread terminated with exception: [Rank 43] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x720c30f6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x720be662a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x720be6631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x702094e5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x702098e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x702098f26850 in /lib/x86_64-linux-gnu/libc.so.6) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x720be663361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x720c31c5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x720c35a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x720c35b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 43] Process group watchdog thread terminated with exception: [Rank 43] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x720c30f6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x720be662a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x720be6631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x720be663361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x720c31c5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x720c35a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x720c35b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x720c30f6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x720be62a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x720c31c5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x720c35a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x720c35b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank75]: Traceback (most recent call last): [rank75]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank75]: train(attn_implementation="flash_attention_2") [rank75]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank75]: trainer.train() [rank75]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank75]: return inner_training_loop( [rank75]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank75]: self._maybe_log_save_evaluate( [rank75]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank75]: self._save_checkpoint(model, trial) [rank75]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank75]: self._save_optimizer_and_scheduler(output_dir) [rank75]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank75]: self.model_wrapped.save_checkpoint(output_dir) [rank75]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank75]: self._create_zero_checkpoint_files(save_dir, tag) [rank75]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank75]: dist.barrier(group=self.optimizer.dp_process_group) [rank75]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank75]: return func(*args, **kwargs) [rank75]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank75]: return cdb.barrier(group=group, async_op=async_op) [rank75]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank75]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank75]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank75]: return func(*args, **kwargs) [rank75]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank75]: work = group.barrier(opts=opts) [rank75]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 75. [rank38]: Traceback (most recent call last): [rank38]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank38]: train(attn_implementation="flash_attention_2") [rank38]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank38]: trainer.train() [rank38]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank38]: return inner_training_loop( [rank38]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank38]: self._maybe_log_save_evaluate( [rank38]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank38]: self._save_checkpoint(model, trial) [rank38]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank38]: self._save_optimizer_and_scheduler(output_dir) [rank38]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank38]: self.model_wrapped.save_checkpoint(output_dir) [rank38]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank38]: self._create_zero_checkpoint_files(save_dir, tag) [rank38]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank38]: dist.barrier(group=self.optimizer.dp_process_group) [rank38]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank38]: return func(*args, **kwargs) [rank38]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank38]: return cdb.barrier(group=group, async_op=async_op) [rank38]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank38]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank38]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank38]: return func(*args, **kwargs) [rank38]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank38]: work = group.barrier(opts=opts) [rank38]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 38. [rank75]:[E217 15:05:51.614654767 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 75] Process group watchdog thread terminated with exception: [Rank 75] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e629af76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e625022a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e6250231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e625023361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e629ba555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e629f894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e629f926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 75] Process group watchdog thread terminated with exception: [Rank 75] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e629af76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e625022a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e6250231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e625023361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e629ba555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e629f894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e629f926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e629af76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7e624fea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank59]:[E217 15:05:51.294302454 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 59] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank59]:[E217 15:05:51.294328485 ProcessGroupNCCL.cpp:630] [Rank 59] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank59]:[E217 15:05:51.294336105 ProcessGroupNCCL.cpp:636] [Rank 59] To avoid data inconsistency, we are taking the entire process down. frame #2: + 0x145c0 (0x7e629ba555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7e629f894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7e629f926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank31]:[E217 15:05:51.470930399 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 31] Process group watchdog thread terminated with exception: [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b1b61776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b1b16a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b1b16a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b1b16a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b1b61e585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b1b66094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b1b66126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 31] Process group watchdog thread terminated with exception: [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b1b61776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b1b16a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b1b16a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b1b16a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b1b61e585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b1b66094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b1b66126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b1b61776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7b1b166a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7b1b61e585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7b1b66094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7b1b66126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank12]:[E217 15:05:51.633715173 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600067 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73c1f0776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73c1a5a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73c1a5a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73c1a5a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73c1f0e585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73c1f5094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73c1f5126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank25]:[E217 15:05:51.475274719 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 25] Process group watchdog thread terminated with exception: [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78f895b76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78f84ae2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78f84ae31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78f84ae3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78f8962585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78f89a494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78f89a526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600067 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73c1f0776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73c1a5a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73c1a5a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73c1a5a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73c1f0e585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73c1f5094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73c1f5126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73c1f0776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x73c1a56a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x73c1f0e585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x73c1f5094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x73c1f5126850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 25] Process group watchdog thread terminated with exception: [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78f895b76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78f84ae2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78f84ae31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78f84ae3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78f8962585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78f89a494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78f89a526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78f895b76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x78f84aaa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x78f8962585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x78f89a494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x78f89a526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank105]:[E217 15:05:51.799228265 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 105] Process group watchdog thread terminated with exception: [Rank 105] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cf83796c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7cf7ed02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cf7ed031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cf7ed03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7cf838a5d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7cf83c694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7cf83c726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 105] Process group watchdog thread terminated with exception: [Rank 105] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cf83796c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7cf7ed02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cf7ed031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cf7ed03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7cf838a5d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7cf83c694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7cf83c726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cf83796c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7cf7ecca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7cf838a5d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7cf83c694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7cf83c726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank14]: Traceback (most recent call last): [rank14]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank14]: train(attn_implementation="flash_attention_2") [rank14]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank14]: trainer.train() [rank14]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank14]: return inner_training_loop( [rank14]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank14]: self._maybe_log_save_evaluate( [rank14]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank14]: self._save_checkpoint(model, trial) [rank14]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank14]: self._save_optimizer_and_scheduler(output_dir) [rank14]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank14]: self.model_wrapped.save_checkpoint(output_dir) [rank14]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank14]: self._create_zero_checkpoint_files(save_dir, tag) [rank14]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank14]: dist.barrier(group=self.optimizer.dp_process_group) [rank14]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank14]: return func(*args, **kwargs) [rank14]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank14]: return cdb.barrier(group=group, async_op=async_op) [rank14]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank14]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank14]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank14]: return func(*args, **kwargs) [rank14]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank14]: work = group.barrier(opts=opts) [rank14]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 14. [rank79]:[E217 15:05:51.629360166 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 79] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank79]:[E217 15:05:51.629382587 ProcessGroupNCCL.cpp:630] [Rank 79] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank79]:[E217 15:05:51.629387687 ProcessGroupNCCL.cpp:636] [Rank 79] To avoid data inconsistency, we are taking the entire process down. [rank44]:[E217 15:05:51.710326853 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 44] Process group watchdog thread terminated with exception: [Rank 44] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b70c436c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b7079a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b7079a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b7079a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b70c505c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b70c9094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b70c9126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank73]:[E217 15:05:51.629950410 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 73] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank73]:[E217 15:05:51.629970962 ProcessGroupNCCL.cpp:630] [Rank 73] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank73]:[E217 15:05:51.629975132 ProcessGroupNCCL.cpp:636] [Rank 73] To avoid data inconsistency, we are taking the entire process down. what(): [PG ID 1 PG GUID 1 Rank 44] Process group watchdog thread terminated with exception: [Rank 44] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b70c436c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b7079a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b7079a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b7079a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b70c505c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b70c9094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b70c9126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b70c436c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7b70796a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7b70c505c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7b70c9094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7b70c9126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank34]: Traceback (most recent call last): [rank34]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank34]: train(attn_implementation="flash_attention_2") [rank34]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank34]: trainer.train() [rank34]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank34]: return inner_training_loop( [rank34]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank34]: self._maybe_log_save_evaluate( [rank34]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank34]: self._save_checkpoint(model, trial) [rank34]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank34]: self._save_optimizer_and_scheduler(output_dir) [rank34]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank34]: self.model_wrapped.save_checkpoint(output_dir) [rank34]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank34]: self._create_zero_checkpoint_files(save_dir, tag) [rank34]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank34]: dist.barrier(group=self.optimizer.dp_process_group) [rank34]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank34]: return func(*args, **kwargs) [rank34]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank34]: return cdb.barrier(group=group, async_op=async_op) [rank34]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank34]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank34]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank34]: return func(*args, **kwargs) [rank34]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank34]: work = group.barrier(opts=opts) [rank34]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 34. [rank38]:[E217 15:05:51.667241322 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 38] Process group watchdog thread terminated with exception: [Rank 38] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600027 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8980b93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f8935e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8935e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8935e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f8980cee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f8985494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f8985526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 38] Process group watchdog thread terminated with exception: [Rank 38] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600027 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8980b93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f8935e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8935e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8935e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f8980cee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f8985494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f8985526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8980b93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f8935aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f8980cee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f8985494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f8985526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank27]: Traceback (most recent call last): [rank27]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank27]: train(attn_implementation="flash_attention_2") [rank27]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank27]: trainer.train() [rank27]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank27]: return inner_training_loop( [rank27]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank27]: self._maybe_log_save_evaluate( [rank27]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank27]: self._save_checkpoint(model, trial) [rank27]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank27]: self._save_optimizer_and_scheduler(output_dir) [rank27]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank27]: self.model_wrapped.save_checkpoint(output_dir) [rank27]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank27]: self._create_zero_checkpoint_files(save_dir, tag) [rank27]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank27]: dist.barrier(group=self.optimizer.dp_process_group) [rank27]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank27]: return func(*args, **kwargs) [rank27]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank27]: return cdb.barrier(group=group, async_op=async_op) [rank27]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank27]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank27]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank27]: return func(*args, **kwargs) [rank27]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank27]: work = group.barrier(opts=opts) [rank27]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 27. [rank76]:[E217 15:05:51.636521969 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 76] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank76]:[E217 15:05:51.636543130 ProcessGroupNCCL.cpp:630] [Rank 76] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank76]:[E217 15:05:51.636549840 ProcessGroupNCCL.cpp:636] [Rank 76] To avoid data inconsistency, we are taking the entire process down. [rank86]:[E217 15:05:51.778304765 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 86] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank86]:[E217 15:05:51.778331866 ProcessGroupNCCL.cpp:630] [Rank 86] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank86]:[E217 15:05:51.778337317 ProcessGroupNCCL.cpp:636] [Rank 86] To avoid data inconsistency, we are taking the entire process down. [rank116]:[E217 15:05:51.578326348 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 116] Process group watchdog thread terminated with exception: [Rank 116] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c30900c1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c304542a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c3045431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c304543361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c309021c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c3094a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c3094b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 116] Process group watchdog thread terminated with exception: [Rank 116] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c30900c1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c304542a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c3045431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank78]:[E217 15:05:51.638718168 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 78] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank78]:[E217 15:05:51.638739369 ProcessGroupNCCL.cpp:630] [Rank 78] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c304543361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c309021c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c3094a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c3094b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c30900c1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7c30450a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank78]:[E217 15:05:51.638744780 ProcessGroupNCCL.cpp:636] [Rank 78] To avoid data inconsistency, we are taking the entire process down. frame #2: + 0x145c0 (0x7c309021c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7c3094a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7c3094b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank14]:[E217 15:05:51.655732980 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72f75bd2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72f71102a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72f711031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72f71103361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72f75c4735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72f760694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72f760726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72f75bd2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72f71102a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72f711031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72f71103361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72f75c4735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72f760694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72f760726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72f75bd2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x72f710ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x72f75c4735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x72f760694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x72f760726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank111]:[E217 15:05:51.817696845 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 111] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank111]:[E217 15:05:51.817729576 ProcessGroupNCCL.cpp:630] [Rank 111] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank111]:[E217 15:05:51.817735626 ProcessGroupNCCL.cpp:636] [Rank 111] To avoid data inconsistency, we are taking the entire process down. [rank34]:[E217 15:05:51.680211644 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 34] Process group watchdog thread terminated with exception: [Rank 34] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72d0f516c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72d0aa82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72d0aa831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72d0aa83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72d0f5e5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72d0f9e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72d0f9f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 34] Process group watchdog thread terminated with exception: [Rank 34] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72d0f516c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72d0aa82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72d0aa831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72d0aa83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72d0f5e5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72d0f9e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72d0f9f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72d0f516c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x72d0aa4a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x72d0f5e5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x72d0f9e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x72d0f9f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank109]:[E217 15:05:51.828771591 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 109] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank109]:[E217 15:05:51.828790561 ProcessGroupNCCL.cpp:630] [Rank 109] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank109]:[E217 15:05:51.828795071 ProcessGroupNCCL.cpp:636] [Rank 109] To avoid data inconsistency, we are taking the entire process down. [rank27]:[E217 15:05:51.511299245 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 27] Process group watchdog thread terminated with exception: [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d79292bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7d78de62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7d78de631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7d78de63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7d79294175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7d792dc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7d792dd26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 27] Process group watchdog thread terminated with exception: [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d79292bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7d78de62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7d78de631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7d78de63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7d79294175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7d792dc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7d792dd26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d79292bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7d78de2a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7d79294175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7d792dc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7d792dd26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank59]:[E217 15:05:51.341367263 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 59] Process group watchdog thread terminated with exception: [Rank 59] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x769a010e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7699b642a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7699b6431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7699b643361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x769a018735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x769a05a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x769a05b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 59] Process group watchdog thread terminated with exception: [Rank 59] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x769a010e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7699b642a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7699b6431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7699b643361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x769a018735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x769a05a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x769a05b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x769a010e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7699b60a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x769a018735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x769a05a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x769a05b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank76]:[E217 15:05:51.676154809 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 76] Process group watchdog thread terminated with exception: [Rank 76] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ebbe3193446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ebb9842a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ebb98431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ebb9843361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ebbe32ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ebbe7a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ebbe7b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 76] Process group watchdog thread terminated with exception: [Rank 76] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ebbe3193446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ebb9842a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ebb98431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ebb9843361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ebbe32ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ebbe7a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ebbe7b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ebbe3193446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7ebb980a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7ebbe32ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7ebbe7a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7ebbe7b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank73]: Traceback (most recent call last): [rank73]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank73]: train(attn_implementation="flash_attention_2") [rank73]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank73]: trainer.train() [rank73]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank73]: return inner_training_loop( [rank73]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank73]: self._maybe_log_save_evaluate( [rank73]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank73]: self._save_checkpoint(model, trial) [rank73]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank73]: self._save_optimizer_and_scheduler(output_dir) [rank73]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank73]: self.model_wrapped.save_checkpoint(output_dir) [rank73]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank73]: self._create_zero_checkpoint_files(save_dir, tag) [rank73]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank73]: dist.barrier(group=self.optimizer.dp_process_group) [rank73]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank73]: return func(*args, **kwargs) [rank73]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank73]: return cdb.barrier(group=group, async_op=async_op) [rank73]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank73]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank73]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank73]: return func(*args, **kwargs) [rank73]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank73]: work = group.barrier(opts=opts) [rank73]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 73. [rank33]:[E217 15:05:51.717907617 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 33] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank33]:[E217 15:05:51.717934208 ProcessGroupNCCL.cpp:630] [Rank 33] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank33]:[E217 15:05:51.717939768 ProcessGroupNCCL.cpp:636] [Rank 33] To avoid data inconsistency, we are taking the entire process down. [rank79]: Traceback (most recent call last): [rank79]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank79]: train(attn_implementation="flash_attention_2") [rank79]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank79]: trainer.train() [rank79]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank79]: return inner_training_loop( [rank79]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank79]: self._maybe_log_save_evaluate( [rank79]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank79]: self._save_checkpoint(model, trial) [rank79]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank79]: self._save_optimizer_and_scheduler(output_dir) [rank79]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank79]: self.model_wrapped.save_checkpoint(output_dir) [rank79]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank79]: self._create_zero_checkpoint_files(save_dir, tag) [rank79]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank79]: dist.barrier(group=self.optimizer.dp_process_group) [rank79]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank79]: return func(*args, **kwargs) [rank35]:[E217 15:05:51.718880928 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 35] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank35]:[E217 15:05:51.718900769 ProcessGroupNCCL.cpp:630] [Rank 35] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank35]:[E217 15:05:51.718905829 ProcessGroupNCCL.cpp:636] [Rank 35] To avoid data inconsistency, we are taking the entire process down. [rank79]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank79]: return cdb.barrier(group=group, async_op=async_op) [rank79]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank79]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank79]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank79]: return func(*args, **kwargs) [rank79]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank79]: work = group.barrier(opts=opts) [rank79]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 79. [rank73]:[E217 15:05:51.694557376 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 73] Process group watchdog thread terminated with exception: [Rank 73] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76765096a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x767605c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x767605c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x767605c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7676510585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x767655294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x767655326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 73] Process group watchdog thread terminated with exception: [Rank 73] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76765096a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x767605c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x767605c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x767605c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7676510585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x767655294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x767655326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76765096a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7676058a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7676510585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x767655294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x767655326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank106]:[E217 15:05:51.869899372 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 106] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank106]:[E217 15:05:51.869925943 ProcessGroupNCCL.cpp:630] [Rank 106] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank106]:[E217 15:05:51.869933323 ProcessGroupNCCL.cpp:636] [Rank 106] To avoid data inconsistency, we are taking the entire process down. [rank78]: Traceback (most recent call last): [rank78]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank78]: train(attn_implementation="flash_attention_2") [rank78]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank78]: trainer.train() [rank78]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank78]: return inner_training_loop( [rank78]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank78]: self._maybe_log_save_evaluate( [rank78]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank78]: self._save_checkpoint(model, trial) [rank78]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank78]: self._save_optimizer_and_scheduler(output_dir) [rank78]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank78]: self.model_wrapped.save_checkpoint(output_dir) [rank78]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank78]: self._create_zero_checkpoint_files(save_dir, tag) [rank78]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank78]: dist.barrier(group=self.optimizer.dp_process_group) [rank78]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank78]: return func(*args, **kwargs) [rank78]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank78]: return cdb.barrier(group=group, async_op=async_op) [rank78]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank78]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank78]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank78]: return func(*args, **kwargs) [rank78]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank78]: work = group.barrier(opts=opts) [rank78]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 78. [rank86]: Traceback (most recent call last): [rank86]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank86]: train(attn_implementation="flash_attention_2") [rank86]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank86]: trainer.train() [rank86]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank86]: return inner_training_loop( [rank86]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank86]: self._maybe_log_save_evaluate( [rank86]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank86]: self._save_checkpoint(model, trial) [rank86]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank86]: self._save_optimizer_and_scheduler(output_dir) [rank86]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank86]: self.model_wrapped.save_checkpoint(output_dir) [rank86]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank86]: self._create_zero_checkpoint_files(save_dir, tag) [rank86]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank86]: dist.barrier(group=self.optimizer.dp_process_group) [rank86]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank86]: return func(*args, **kwargs) [rank86]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank86]: return cdb.barrier(group=group, async_op=async_op) [rank86]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank86]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank86]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank86]: return func(*args, **kwargs) [rank86]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank86]: work = group.barrier(opts=opts) [rank86]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 86. [rank111]: Traceback (most recent call last): [rank111]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank111]: train(attn_implementation="flash_attention_2") [rank111]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank111]: trainer.train() [rank111]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank111]: return inner_training_loop( [rank111]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank111]: self._maybe_log_save_evaluate( [rank111]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank111]: self._save_checkpoint(model, trial) [rank111]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank111]: self._save_optimizer_and_scheduler(output_dir) [rank111]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank111]: self.model_wrapped.save_checkpoint(output_dir) [rank111]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank111]: self._create_zero_checkpoint_files(save_dir, tag) [rank111]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank111]: dist.barrier(group=self.optimizer.dp_process_group) [rank111]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank111]: return func(*args, **kwargs) [rank111]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank111]: return cdb.barrier(group=group, async_op=async_op) [rank111]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank111]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank111]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank111]: return func(*args, **kwargs) [rank111]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank111]: work = group.barrier(opts=opts) [rank111]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 111. [rank109]: Traceback (most recent call last): [rank109]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank109]: train(attn_implementation="flash_attention_2") [rank109]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank109]: trainer.train() [rank109]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank109]: return inner_training_loop( [rank109]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank109]: self._maybe_log_save_evaluate( [rank109]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank109]: self._save_checkpoint(model, trial) [rank109]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank109]: self._save_optimizer_and_scheduler(output_dir) [rank109]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank109]: self.model_wrapped.save_checkpoint(output_dir) [rank109]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank109]: self._create_zero_checkpoint_files(save_dir, tag) [rank109]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank109]: dist.barrier(group=self.optimizer.dp_process_group) [rank109]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank109]: return func(*args, **kwargs) [rank79]:[E217 15:05:51.708376481 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 79] Process group watchdog thread terminated with exception: [Rank 79] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x722418f93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7223ce22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7223ce231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank109]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank109]: return cdb.barrier(group=group, async_op=async_op) [rank109]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank109]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank109]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank109]: return func(*args, **kwargs) [rank109]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank109]: work = group.barrier(opts=opts) [rank109]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 109. frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7223ce23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7224190ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72241d894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72241d926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 79] Process group watchdog thread terminated with exception: [Rank 79] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x722418f93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7223ce22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7223ce231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7223ce23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7224190ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72241d894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72241d926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x722418f93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7223cdea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7224190ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x72241d894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x72241d926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank111]:[E217 15:05:51.883704901 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 111] Process group watchdog thread terminated with exception: [Rank 111] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74aa80d76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x74aa3602a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x74aa36031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x74aa3603361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x74aa818555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x74aa85694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x74aa85726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 111] Process group watchdog thread terminated with exception: [Rank 111] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74aa80d76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x74aa3602a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x74aa36031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x74aa3603361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x74aa818555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x74aa85694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x74aa85726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74aa80d76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x74aa35ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x74aa818555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x74aa85694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x74aa85726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank86]:[E217 15:05:51.855812304 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 86] Process group watchdog thread terminated with exception: [Rank 86] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d1da82db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7d1d5d62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7d1d5d631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7d1d5d63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7d1da8e525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7d1dacc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7d1dacd26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 86] Process group watchdog thread terminated with exception: [Rank 86] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d1da82db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7d1d5d62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7d1d5d631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7d1d5d63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7d1da8e525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7d1dacc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7d1dacd26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d1da82db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7d1d5d2a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7d1da8e525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7d1dacc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7d1dacd26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank26]:[E217 15:05:51.576825283 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 26] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank26]:[E217 15:05:51.576849444 ProcessGroupNCCL.cpp:630] [Rank 26] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank26]:[E217 15:05:51.576856165 ProcessGroupNCCL.cpp:636] [Rank 26] To avoid data inconsistency, we are taking the entire process down. [rank30]:[E217 15:05:51.577057602 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 30] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank30]:[E217 15:05:51.577087963 ProcessGroupNCCL.cpp:630] [Rank 30] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank30]:[E217 15:05:51.577094604 ProcessGroupNCCL.cpp:636] [Rank 30] To avoid data inconsistency, we are taking the entire process down. [rank28]:[E217 15:05:51.578532650 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 28] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank28]:[E217 15:05:51.578545871 ProcessGroupNCCL.cpp:630] [Rank 28] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank28]:[E217 15:05:51.578549641 ProcessGroupNCCL.cpp:636] [Rank 28] To avoid data inconsistency, we are taking the entire process down. [rank109]:[E217 15:05:51.898153343 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 109] Process group watchdog thread terminated with exception: [Rank 109] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72e9792e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72e92e62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72e92e631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72e92e63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72e979a6d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72e97dc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72e97dd26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 109] Process group watchdog thread terminated with exception: [Rank 109] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72e9792e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72e92e62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72e92e631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72e92e63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72e979a6d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72e97dc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72e97dd26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72e9792e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x72e92e2a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x72e979a6d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x72e97dc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x72e97dd26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank33]: Traceback (most recent call last): [rank33]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank33]: train(attn_implementation="flash_attention_2") [rank33]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank33]: trainer.train() [rank33]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank33]: return inner_training_loop( [rank33]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank33]: self._maybe_log_save_evaluate( [rank33]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank33]: self._save_checkpoint(model, trial) [rank33]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank33]: self._save_optimizer_and_scheduler(output_dir) [rank33]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank33]: self.model_wrapped.save_checkpoint(output_dir) [rank33]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank33]: self._create_zero_checkpoint_files(save_dir, tag) [rank33]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank33]: dist.barrier(group=self.optimizer.dp_process_group) [rank33]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank33]: return func(*args, **kwargs) [rank33]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank33]: return cdb.barrier(group=group, async_op=async_op) [rank33]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank33]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank33]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank33]: return func(*args, **kwargs) [rank33]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank33]: work = group.barrier(opts=opts) [rank33]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 33. [rank118]:[E217 15:05:51.671968456 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 118] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank118]:[E217 15:05:51.671990017 ProcessGroupNCCL.cpp:630] [Rank 118] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank118]:[E217 15:05:51.671995077 ProcessGroupNCCL.cpp:636] [Rank 118] To avoid data inconsistency, we are taking the entire process down. [rank114]:[E217 15:05:51.672088432 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 114] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank114]:[E217 15:05:51.672107353 ProcessGroupNCCL.cpp:630] [Rank 114] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank114]:[E217 15:05:51.672112103 ProcessGroupNCCL.cpp:636] [Rank 114] To avoid data inconsistency, we are taking the entire process down. [rank33]:[E217 15:05:51.767034830 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 33] Process group watchdog thread terminated with exception: [Rank 33] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fce09d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7fcdbf42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fcdbf431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fcdbf43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7fce0a1635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7fce0e894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fce0e926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 33] Process group watchdog thread terminated with exception: [Rank 33] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fce09d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7fcdbf42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fcdbf431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fcdbf43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7fce0a1635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7fce0e894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fce0e926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fce09d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7fcdbf0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7fce0a1635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7fce0e894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7fce0e926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank98]:[E217 15:05:51.395227108 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 98] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank98]:[E217 15:05:51.395257909 ProcessGroupNCCL.cpp:630] [Rank 98] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank98]:[E217 15:05:51.395265849 ProcessGroupNCCL.cpp:636] [Rank 98] To avoid data inconsistency, we are taking the entire process down. [rank100]:[E217 15:05:51.397555973 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 100] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank100]:[E217 15:05:51.397574793 ProcessGroupNCCL.cpp:630] [Rank 100] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank100]:[E217 15:05:51.397580273 ProcessGroupNCCL.cpp:636] [Rank 100] To avoid data inconsistency, we are taking the entire process down. [rank22]:[E217 15:05:51.004420757 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 22] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank22]:[E217 15:05:51.004441309 ProcessGroupNCCL.cpp:630] [Rank 22] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank22]:[E217 15:05:51.004445959 ProcessGroupNCCL.cpp:636] [Rank 22] To avoid data inconsistency, we are taking the entire process down. [rank35]: Traceback (most recent call last): [rank35]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank35]: train(attn_implementation="flash_attention_2") [rank35]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank35]: trainer.train() [rank35]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank35]: return inner_training_loop( [rank35]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank35]: self._maybe_log_save_evaluate( [rank35]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank35]: self._save_checkpoint(model, trial) [rank35]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank35]: self._save_optimizer_and_scheduler(output_dir) [rank35]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank35]: self.model_wrapped.save_checkpoint(output_dir) [rank35]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank35]: self._create_zero_checkpoint_files(save_dir, tag) [rank35]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank35]: dist.barrier(group=self.optimizer.dp_process_group) [rank35]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank35]: return func(*args, **kwargs) [rank35]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank35]: return cdb.barrier(group=group, async_op=async_op) [rank35]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank35]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank35]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank35]: return func(*args, **kwargs) [rank35]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank35]: work = group.barrier(opts=opts) [rank35]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 35. [rank106]: Traceback (most recent call last): [rank106]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank106]: train(attn_implementation="flash_attention_2") [rank106]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank106]: trainer.train() [rank106]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank106]: return inner_training_loop( [rank106]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank106]: self._maybe_log_save_evaluate( [rank106]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank106]: self._save_checkpoint(model, trial) [rank106]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank106]: self._save_optimizer_and_scheduler(output_dir) [rank106]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank106]: self.model_wrapped.save_checkpoint(output_dir) [rank106]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank106]: self._create_zero_checkpoint_files(save_dir, tag) [rank106]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank106]: dist.barrier(group=self.optimizer.dp_process_group) [rank106]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank106]: return func(*args, **kwargs) [rank106]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank106]: return cdb.barrier(group=group, async_op=async_op) [rank106]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank106]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank106]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank106]: return func(*args, **kwargs) [rank106]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank106]: work = group.barrier(opts=opts) [rank106]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 106. [rank107]:[E217 15:05:51.921060869 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 107] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank107]:[E217 15:05:51.921084640 ProcessGroupNCCL.cpp:630] [Rank 107] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank107]:[E217 15:05:51.921090500 ProcessGroupNCCL.cpp:636] [Rank 107] To avoid data inconsistency, we are taking the entire process down. [rank57]:[E217 15:05:51.426302725 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 57] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank57]:[E217 15:05:51.426332337 ProcessGroupNCCL.cpp:630] [Rank 57] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank57]:[E217 15:05:51.426337897 ProcessGroupNCCL.cpp:636] [Rank 57] To avoid data inconsistency, we are taking the entire process down. [rank108]:[E217 15:05:51.927524675 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 108] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank108]:[E217 15:05:51.927544666 ProcessGroupNCCL.cpp:630] [Rank 108] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank108]:[E217 15:05:51.927549716 ProcessGroupNCCL.cpp:636] [Rank 108] To avoid data inconsistency, we are taking the entire process down. [rank110]:[E217 15:05:51.929655007 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 110] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank110]:[E217 15:05:51.929679997 ProcessGroupNCCL.cpp:630] [Rank 110] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank110]:[E217 15:05:51.929685137 ProcessGroupNCCL.cpp:636] [Rank 110] To avoid data inconsistency, we are taking the entire process down. [rank66]:[E217 15:05:51.563264139 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 66] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank66]:[E217 15:05:51.563292910 ProcessGroupNCCL.cpp:630] [Rank 66] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank66]:[E217 15:05:51.563300410 ProcessGroupNCCL.cpp:636] [Rank 66] To avoid data inconsistency, we are taking the entire process down. [rank61]:[E217 15:05:51.443038563 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 61] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank61]:[E217 15:05:51.443059194 ProcessGroupNCCL.cpp:630] [Rank 61] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank61]:[E217 15:05:51.443064304 ProcessGroupNCCL.cpp:636] [Rank 61] To avoid data inconsistency, we are taking the entire process down. [rank30]: Traceback (most recent call last): [rank30]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank30]: train(attn_implementation="flash_attention_2") [rank30]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank30]: trainer.train() [rank30]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank30]: return inner_training_loop( [rank30]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank30]: self._maybe_log_save_evaluate( [rank30]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank30]: self._save_checkpoint(model, trial) [rank30]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank30]: self._save_optimizer_and_scheduler(output_dir) [rank30]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank30]: self.model_wrapped.save_checkpoint(output_dir) [rank30]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank30]: self._create_zero_checkpoint_files(save_dir, tag) [rank30]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank30]: dist.barrier(group=self.optimizer.dp_process_group) [rank30]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank30]: return func(*args, **kwargs) [rank30]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank30]: return cdb.barrier(group=group, async_op=async_op) [rank30]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank30]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank30]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank30]: return func(*args, **kwargs) [rank30]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank30]: work = group.barrier(opts=opts) [rank30]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 30. [rank26]:[E217 15:05:51.623675553 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 26] Process group watchdog thread terminated with exception: [Rank 26] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7096a29b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x709657c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x709657c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x709657c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7096a346c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7096a7294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7096a7326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank26]: Traceback (most recent call last): [rank26]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank26]: train(attn_implementation="flash_attention_2") [rank26]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank26]: trainer.train() [rank26]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank26]: return inner_training_loop( [rank26]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank26]: self._maybe_log_save_evaluate( [rank26]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank26]: self._save_checkpoint(model, trial) [rank26]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank26]: self._save_optimizer_and_scheduler(output_dir) [rank26]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank26]: self.model_wrapped.save_checkpoint(output_dir) [rank26]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank26]: self._create_zero_checkpoint_files(save_dir, tag) [rank26]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank26]: dist.barrier(group=self.optimizer.dp_process_group) [rank26]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank26]: return func(*args, **kwargs) [rank26]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank26]: return cdb.barrier(group=group, async_op=async_op) [rank26]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank26]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank26]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank26]: return func(*args, **kwargs) [rank26]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank26]: work = group.barrier(opts=opts) [rank26]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 26. what(): [PG ID 1 PG GUID 1 Rank 26] Process group watchdog thread terminated with exception: [Rank 26] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7096a29b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x709657c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x709657c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x709657c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7096a346c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7096a7294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7096a7326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7096a29b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7096578a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7096a346c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7096a7294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7096a7326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank117]:[E217 15:05:51.711539236 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 117] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank117]:[E217 15:05:51.711571388 ProcessGroupNCCL.cpp:630] [Rank 117] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank117]:[E217 15:05:51.711577498 ProcessGroupNCCL.cpp:636] [Rank 117] To avoid data inconsistency, we are taking the entire process down. [rank28]: Traceback (most recent call last): [rank28]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank28]: train(attn_implementation="flash_attention_2") [rank28]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank28]: trainer.train() [rank28]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank28]: return inner_training_loop( [rank28]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank28]: self._maybe_log_save_evaluate( [rank28]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank28]: self._save_checkpoint(model, trial) [rank28]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank28]: self._save_optimizer_and_scheduler(output_dir) [rank28]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank28]: self.model_wrapped.save_checkpoint(output_dir) [rank28]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank28]: self._create_zero_checkpoint_files(save_dir, tag) [rank28]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank28]: dist.barrier(group=self.optimizer.dp_process_group) [rank28]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank28]: return func(*args, **kwargs) [rank28]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank28]: return cdb.barrier(group=group, async_op=async_op) [rank28]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank28]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank28]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank28]: return func(*args, **kwargs) [rank28]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank28]: work = group.barrier(opts=opts) [rank28]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 28. [rank78]:[E217 15:05:51.775798273 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 78] Process group watchdog thread terminated with exception: [Rank 78] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ac798d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ac74e42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ac74e431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ac74e43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ac799a5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ac79d894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ac79d926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 78] Process group watchdog thread terminated with exception: [Rank 78] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ac798d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ac74e42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ac74e431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ac74e43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ac799a5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ac79d894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ac79d926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ac798d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7ac74e0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7ac799a5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7ac79d894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7ac79d926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank102]:[E217 15:05:51.439382244 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 102] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank102]:[E217 15:05:51.439406184 ProcessGroupNCCL.cpp:630] [Rank 102] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank102]:[E217 15:05:51.439410814 ProcessGroupNCCL.cpp:636] [Rank 102] To avoid data inconsistency, we are taking the entire process down. [rank28]:[E217 15:05:51.639180268 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 28] Process group watchdog thread terminated with exception: [Rank 28] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x786ef42af446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x786ea962a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x786ea9631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x786ea963361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x786ef44165c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x786ef8c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x786ef8d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 28] Process group watchdog thread terminated with exception: [Rank 28] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x786ef42af446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x786ea962a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x786ea9631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x786ea963361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x786ef44165c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x786ef8c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x786ef8d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x786ef42af446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x786ea92a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x786ef44165c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x786ef8c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x786ef8d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank98]: Traceback (most recent call last): [rank98]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank98]: train(attn_implementation="flash_attention_2") [rank98]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank98]: trainer.train() [rank98]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank98]: return inner_training_loop( [rank98]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank98]: self._maybe_log_save_evaluate( [rank98]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank98]: self._save_checkpoint(model, trial) [rank98]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank98]: self._save_optimizer_and_scheduler(output_dir) [rank98]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank98]: self.model_wrapped.save_checkpoint(output_dir) [rank98]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank98]: self._create_zero_checkpoint_files(save_dir, tag) [rank98]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank98]: dist.barrier(group=self.optimizer.dp_process_group) [rank98]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank98]: return func(*args, **kwargs) [rank98]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank98]: return cdb.barrier(group=group, async_op=async_op) [rank98]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank98]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank98]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank98]: return func(*args, **kwargs) [rank98]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank98]: work = group.barrier(opts=opts) [rank98]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 98. [rank68]:[E217 15:05:51.589893116 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 68] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank68]:[E217 15:05:51.589923377 ProcessGroupNCCL.cpp:630] [Rank 68] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank68]:[E217 15:05:51.589928617 ProcessGroupNCCL.cpp:636] [Rank 68] To avoid data inconsistency, we are taking the entire process down. [rank30]:[E217 15:05:51.643109861 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 30] Process group watchdog thread terminated with exception: [Rank 30] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x737f95ae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x737f4ae2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x737f4ae31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x737f4ae3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x737f962735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x737f9a494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x737f9a526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 30] Process group watchdog thread terminated with exception: [Rank 30] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x737f95ae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x737f4ae2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x737f4ae31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x737f4ae3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x737f962735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x737f9a494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x737f9a526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x737f95ae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x737f4aaa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x737f962735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x737f9a494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x737f9a526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank68]:[E217 15:05:51.591952948 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 68] Process group watchdog thread terminated with exception: [Rank 68] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600001 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72a717176446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72a6cc42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72a6cc431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72a6cc43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72a717c555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72a71ba94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72a71bb26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank37]:[E217 15:05:51.823322130 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 37] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank37]:[E217 15:05:51.823346862 ProcessGroupNCCL.cpp:630] [Rank 37] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank37]:[E217 15:05:51.823353082 ProcessGroupNCCL.cpp:636] [Rank 37] To avoid data inconsistency, we are taking the entire process down. what(): [PG ID 1 PG GUID 1 Rank 68] Process group watchdog thread terminated with exception: [Rank 68] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600001 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72a717176446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72a6cc42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72a6cc431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72a6cc43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72a717c555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72a71ba94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72a71bb26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72a717176446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x72a6cc0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x72a717c555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x72a71ba94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x72a71bb26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank118]: Traceback (most recent call last): [rank118]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank118]: train(attn_implementation="flash_attention_2") [rank118]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank118]: trainer.train() [rank118]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank118]: return inner_training_loop( [rank118]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank118]: self._maybe_log_save_evaluate( [rank118]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank118]: self._save_checkpoint(model, trial) [rank100]: Traceback (most recent call last): [rank100]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank100]: train(attn_implementation="flash_attention_2") [rank100]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank100]: trainer.train() [rank100]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank100]: return inner_training_loop( [rank100]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank100]: self._maybe_log_save_evaluate( [rank100]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank100]: self._save_checkpoint(model, trial) [rank118]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank118]: self._save_optimizer_and_scheduler(output_dir) [rank118]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank118]: self.model_wrapped.save_checkpoint(output_dir) [rank118]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank118]: self._create_zero_checkpoint_files(save_dir, tag) [rank118]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank118]: dist.barrier(group=self.optimizer.dp_process_group) [rank118]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank118]: return func(*args, **kwargs) [rank100]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank100]: self._save_optimizer_and_scheduler(output_dir) [rank100]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank100]: self.model_wrapped.save_checkpoint(output_dir) [rank100]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank100]: self._create_zero_checkpoint_files(save_dir, tag) [rank100]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank100]: dist.barrier(group=self.optimizer.dp_process_group) [rank100]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank100]: return func(*args, **kwargs) [rank118]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank118]: return cdb.barrier(group=group, async_op=async_op) [rank118]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank118]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank118]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank118]: return func(*args, **kwargs) [rank118]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank118]: work = group.barrier(opts=opts) [rank118]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 118. [rank100]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank100]: return cdb.barrier(group=group, async_op=async_op) [rank100]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank100]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank100]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank100]: return func(*args, **kwargs) [rank100]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank100]: work = group.barrier(opts=opts) [rank100]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 100. [rank39]:[E217 15:05:51.825110344 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 39] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank39]:[E217 15:05:51.825128285 ProcessGroupNCCL.cpp:630] [Rank 39] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank39]:[E217 15:05:51.825132536 ProcessGroupNCCL.cpp:636] [Rank 39] To avoid data inconsistency, we are taking the entire process down. [rank22]:[E217 15:05:51.058076533 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 22] Process group watchdog thread terminated with exception: [Rank 22] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e9af71b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e9aac42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e9aac431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e9aac43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e9af7c6c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e9afba94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e9afbb26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 22] Process group watchdog thread terminated with exception: [Rank 22] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e9af71b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e9aac42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e9aac431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e9aac43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e9af7c6c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e9afba94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e9afbb26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e9af71b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7e9aac0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7e9af7c6c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7e9afba94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7e9afbb26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank114]: Traceback (most recent call last): [rank114]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank114]: train(attn_implementation="flash_attention_2") [rank114]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank114]: trainer.train() [rank114]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank114]: return inner_training_loop( [rank114]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank114]: self._maybe_log_save_evaluate( [rank114]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank114]: self._save_checkpoint(model, trial) [rank114]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank114]: self._save_optimizer_and_scheduler(output_dir) [rank114]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank114]: self.model_wrapped.save_checkpoint(output_dir) [rank114]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank114]: self._create_zero_checkpoint_files(save_dir, tag) [rank114]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank114]: dist.barrier(group=self.optimizer.dp_process_group) [rank114]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank114]: return func(*args, **kwargs) [rank57]:[E217 15:05:51.474625440 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 57] Process group watchdog thread terminated with exception: [Rank 57] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600050 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x737ed5d76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x737e8b02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x737e8b031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank114]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank114]: return cdb.barrier(group=group, async_op=async_op) [rank114]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank114]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank114]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank114]: return func(*args, **kwargs) [rank114]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank114]: work = group.barrier(opts=opts) [rank114]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 114. frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x737e8b03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x737ed68555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x737eda694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x737eda726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 57] Process group watchdog thread terminated with exception: [Rank 57] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600050 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x737ed5d76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x737e8b02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x737e8b031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x737e8b03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x737ed68555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x737eda694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x737eda726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x737ed5d76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x737e8aca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x737ed68555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x737eda694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x737eda726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank118]:[E217 15:05:51.738942979 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 118] Process group watchdog thread terminated with exception: [Rank 118] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78342ceb5446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7833e222a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7833e2231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7833e223361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78342d01c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x783431894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x783431926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 118] Process group watchdog thread terminated with exception: [Rank 118] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78342ceb5446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7833e222a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7833e2231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7833e223361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78342d01c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x783431894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x783431926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78342ceb5446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7833e1ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x78342d01c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x783431894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x783431926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank114]:[E217 15:05:51.739954381 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 114] Process group watchdog thread terminated with exception: [Rank 114] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x787965eaf446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78791b22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78791b231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78791b23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7879660165c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78796a894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78796a926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank108]: Traceback (most recent call last): [rank108]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank108]: train(attn_implementation="flash_attention_2") [rank108]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank108]: trainer.train() [rank108]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank108]: return inner_training_loop( [rank108]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank108]: self._maybe_log_save_evaluate( [rank108]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank108]: self._save_checkpoint(model, trial) [rank108]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank108]: self._save_optimizer_and_scheduler(output_dir) [rank108]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank108]: self.model_wrapped.save_checkpoint(output_dir) [rank108]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank108]: self._create_zero_checkpoint_files(save_dir, tag) [rank108]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank108]: dist.barrier(group=self.optimizer.dp_process_group) [rank108]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank108]: return func(*args, **kwargs) [rank108]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank108]: return cdb.barrier(group=group, async_op=async_op) [rank108]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank108]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank108]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank108]: return func(*args, **kwargs) [rank108]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank108]: work = group.barrier(opts=opts) [rank108]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 108. what(): [PG ID 1 PG GUID 1 Rank 114] Process group watchdog thread terminated with exception: [Rank 114] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600049 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x787965eaf446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78791b22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78791b231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78791b23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7879660165c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78796a894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78796a926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x787965eaf446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x78791aea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7879660165c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x78796a894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x78796a926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank106]:[E217 15:05:51.974982470 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 106] Process group watchdog thread terminated with exception: [Rank 106] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73c43cae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73c3f1e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73c3f1e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73c3f1e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73c43d65e5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73c441494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73c441526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank107]: Traceback (most recent call last): [rank107]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank107]: train(attn_implementation="flash_attention_2") [rank107]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank107]: trainer.train() [rank107]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank107]: return inner_training_loop( [rank107]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank107]: self._maybe_log_save_evaluate( [rank107]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank107]: self._save_checkpoint(model, trial) [rank107]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank107]: self._save_optimizer_and_scheduler(output_dir) [rank107]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank107]: self.model_wrapped.save_checkpoint(output_dir) [rank107]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank107]: self._create_zero_checkpoint_files(save_dir, tag) [rank107]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank107]: dist.barrier(group=self.optimizer.dp_process_group) [rank107]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank107]: return func(*args, **kwargs) [rank107]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank107]: return cdb.barrier(group=group, async_op=async_op) [rank107]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank107]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank107]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank107]: return func(*args, **kwargs) [rank107]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank107]: work = group.barrier(opts=opts) [rank107]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 107. terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 106] Process group watchdog thread terminated with exception: [Rank 106] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73c43cae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73c3f1e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73c3f1e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73c3f1e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73c43d65e5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73c441494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73c441526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73c43cae7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x73c3f1aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x73c43d65e5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x73c441494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x73c441526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank98]:[E217 15:05:51.463781871 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 98] Process group watchdog thread terminated with exception: [Rank 98] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ab953b50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ab908e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ab908e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ab908e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ab953cab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ab958494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ab958526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 98] Process group watchdog thread terminated with exception: [Rank 98] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ab953b50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ab908e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ab908e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ab908e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ab953cab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ab958494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ab958526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ab953b50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7ab908aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7ab953cab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7ab958494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7ab958526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank110]: Traceback (most recent call last): [rank110]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank110]: train(attn_implementation="flash_attention_2") [rank110]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank110]: trainer.train() [rank110]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank110]: return inner_training_loop( [rank110]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank110]: self._maybe_log_save_evaluate( [rank110]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank110]: self._save_checkpoint(model, trial) [rank110]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank110]: self._save_optimizer_and_scheduler(output_dir) [rank110]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank110]: self.model_wrapped.save_checkpoint(output_dir) [rank110]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank110]: self._create_zero_checkpoint_files(save_dir, tag) [rank110]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank110]: dist.barrier(group=self.optimizer.dp_process_group) [rank110]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank110]: return func(*args, **kwargs) [rank110]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank110]: return cdb.barrier(group=group, async_op=async_op) [rank110]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank110]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank110]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank110]: return func(*args, **kwargs) [rank110]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank110]: work = group.barrier(opts=opts) [rank110]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 110. [rank63]:[E217 15:05:51.492301276 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 63] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank63]:[E217 15:05:51.492321667 ProcessGroupNCCL.cpp:630] [Rank 63] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank63]:[E217 15:05:51.492326777 ProcessGroupNCCL.cpp:636] [Rank 63] To avoid data inconsistency, we are taking the entire process down. [rank107]:[E217 15:05:51.987940792 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 107] Process group watchdog thread terminated with exception: [Rank 107] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78f58816c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78f53d82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78f53d831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78f53d83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78f5892565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78f58ce94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78f58cf26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 107] Process group watchdog thread terminated with exception: [Rank 107] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78f58816c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78f53d82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78f53d831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78f53d83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78f5892565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78f58ce94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78f58cf26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78f58816c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x78f53d4a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x78f5892565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x78f58ce94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x78f58cf26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank91]:[E217 15:05:51.042513107 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 91] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank91]:[E217 15:05:51.042539089 ProcessGroupNCCL.cpp:630] [Rank 91] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank91]:[E217 15:05:51.042544039 ProcessGroupNCCL.cpp:636] [Rank 91] To avoid data inconsistency, we are taking the entire process down. [rank61]: Traceback (most recent call last): [rank61]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank61]: train(attn_implementation="flash_attention_2") [rank61]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank61]: trainer.train() [rank61]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank61]: return inner_training_loop( [rank61]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank61]: self._maybe_log_save_evaluate( [rank61]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank61]: self._save_checkpoint(model, trial) [rank61]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank61]: self._save_optimizer_and_scheduler(output_dir) [rank61]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank61]: self.model_wrapped.save_checkpoint(output_dir) [rank61]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank61]: self._create_zero_checkpoint_files(save_dir, tag) [rank61]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank61]: dist.barrier(group=self.optimizer.dp_process_group) [rank61]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank61]: return func(*args, **kwargs) [rank61]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank61]: return cdb.barrier(group=group, async_op=async_op) [rank61]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank61]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank61]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank61]: return func(*args, **kwargs) [rank61]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank61]: work = group.barrier(opts=opts) [rank61]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 61. [rank66]: Traceback (most recent call last): [rank66]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank66]: train(attn_implementation="flash_attention_2") [rank66]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank66]: trainer.train() [rank66]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank66]: return inner_training_loop( [rank66]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank66]: self._maybe_log_save_evaluate( [rank66]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank66]: self._save_checkpoint(model, trial) [rank66]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank66]: self._save_optimizer_and_scheduler(output_dir) [rank66]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank66]: self.model_wrapped.save_checkpoint(output_dir) [rank66]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank66]: self._create_zero_checkpoint_files(save_dir, tag) [rank66]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank66]: dist.barrier(group=self.optimizer.dp_process_group) [rank66]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank66]: return func(*args, **kwargs) [rank119]:[E217 15:05:51.756584689 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 119] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank119]:[E217 15:05:51.756602440 ProcessGroupNCCL.cpp:630] [Rank 119] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank66]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank66]: return cdb.barrier(group=group, async_op=async_op) [rank66]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank66]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank66]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank66]: return func(*args, **kwargs) [rank66]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank66]: work = group.barrier(opts=opts) [rank66]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 66. [rank119]:[E217 15:05:51.756607160 ProcessGroupNCCL.cpp:636] [Rank 119] To avoid data inconsistency, we are taking the entire process down. [rank108]:[E217 15:05:51.992478171 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 108] Process group watchdog thread terminated with exception: [Rank 108] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cc293750446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7cc248a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cc248a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cc248a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7cc2938ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7cc298094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7cc298126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank113]:[E217 15:05:51.758325008 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 113] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank113]:[E217 15:05:51.758344610 ProcessGroupNCCL.cpp:630] [Rank 113] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank113]:[E217 15:05:51.758349330 ProcessGroupNCCL.cpp:636] [Rank 113] To avoid data inconsistency, we are taking the entire process down. what(): [PG ID 1 PG GUID 1 Rank 108] Process group watchdog thread terminated with exception: [Rank 108] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cc293750446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7cc248a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cc248a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cc248a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7cc2938ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7cc298094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7cc298126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cc293750446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7cc2486a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7cc2938ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7cc298094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7cc298126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank82]:[E217 15:05:51.960329046 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 82] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank82]:[E217 15:05:51.960353467 ProcessGroupNCCL.cpp:630] [Rank 82] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank82]:[E217 15:05:51.960358148 ProcessGroupNCCL.cpp:636] [Rank 82] To avoid data inconsistency, we are taking the entire process down. [rank89]:[E217 15:05:51.046882047 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 89] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank89]:[E217 15:05:51.046909009 ProcessGroupNCCL.cpp:630] [Rank 89] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank89]:[E217 15:05:51.046915949 ProcessGroupNCCL.cpp:636] [Rank 89] To avoid data inconsistency, we are taking the entire process down. [rank115]:[E217 15:05:51.761056729 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 115] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank115]:[E217 15:05:51.761090321 ProcessGroupNCCL.cpp:630] [Rank 115] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank115]:[E217 15:05:51.761094741 ProcessGroupNCCL.cpp:636] [Rank 115] To avoid data inconsistency, we are taking the entire process down. [rank110]:[E217 15:05:51.998867925 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 110] Process group watchdog thread terminated with exception: [Rank 110] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75ccbf56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75cc74c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75cc74c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75cc74c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75ccc025c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75ccc4294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75ccc4326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 110] Process group watchdog thread terminated with exception: [Rank 110] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75ccbf56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75cc74c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75cc74c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75cc74c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75ccc025c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75ccc4294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75ccc4326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75ccbf56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x75cc748a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x75ccc025c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x75ccc4294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x75ccc4326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank117]: Traceback (most recent call last): [rank117]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank117]: train(attn_implementation="flash_attention_2") [rank117]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank117]: trainer.train() [rank117]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank117]: return inner_training_loop( [rank117]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank117]: self._maybe_log_save_evaluate( [rank117]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank117]: self._save_checkpoint(model, trial) [rank117]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank117]: self._save_optimizer_and_scheduler(output_dir) [rank117]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank117]: self.model_wrapped.save_checkpoint(output_dir) [rank117]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank117]: self._create_zero_checkpoint_files(save_dir, tag) [rank117]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank117]: dist.barrier(group=self.optimizer.dp_process_group) [rank117]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank117]: return func(*args, **kwargs) [rank117]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank117]: return cdb.barrier(group=group, async_op=async_op) [rank117]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank117]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank117]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank117]: return func(*args, **kwargs) [rank117]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank117]: work = group.barrier(opts=opts) [rank117]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 117. [rank102]: Traceback (most recent call last): [rank102]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank102]: train(attn_implementation="flash_attention_2") [rank102]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank102]: trainer.train() [rank102]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank102]: return inner_training_loop( [rank102]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank102]: self._maybe_log_save_evaluate( [rank102]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank102]: self._save_checkpoint(model, trial) [rank102]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank102]: self._save_optimizer_and_scheduler(output_dir) [rank102]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank102]: self.model_wrapped.save_checkpoint(output_dir) [rank102]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank102]: self._create_zero_checkpoint_files(save_dir, tag) [rank102]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank102]: dist.barrier(group=self.optimizer.dp_process_group) [rank102]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank102]: return func(*args, **kwargs) [rank102]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank102]: return cdb.barrier(group=group, async_op=async_op) [rank102]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank102]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank102]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank102]: return func(*args, **kwargs) [rank102]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank102]: work = group.barrier(opts=opts) [rank102]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 102. [rank35]:[E217 15:05:51.867103823 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 35] Process group watchdog thread terminated with exception: [Rank 35] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e7807776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e77bca2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e77bca31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e77bca3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e78082555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e780c094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e780c126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 35] Process group watchdog thread terminated with exception: [Rank 35] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e7807776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e77bca2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e77bca31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e77bca3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e78082555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e780c094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e780c126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e7807776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7e77bc6a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7e78082555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7e780c094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7e780c126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank37]:[E217 15:05:51.870246888 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 37] Process group watchdog thread terminated with exception: [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600027 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f88c58db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f887ac2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f887ac31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f887ac3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f88c64585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f88ca294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f88ca326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 37] Process group watchdog thread terminated with exception: [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600027 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f88c58db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f887ac2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f887ac31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f887ac3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f88c64585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f88ca294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f88ca326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f88c58db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f887a8a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f88c64585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f88ca294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f88ca326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank39]: Traceback (most recent call last): [rank39]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank39]: train(attn_implementation="flash_attention_2") [rank39]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank39]: trainer.train() [rank39]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank39]: return inner_training_loop( [rank39]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank39]: self._maybe_log_save_evaluate( [rank39]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank39]: self._save_checkpoint(model, trial) [rank39]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank39]: self._save_optimizer_and_scheduler(output_dir) [rank39]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank39]: self.model_wrapped.save_checkpoint(output_dir) [rank39]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank39]: self._create_zero_checkpoint_files(save_dir, tag) [rank39]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank39]: dist.barrier(group=self.optimizer.dp_process_group) [rank39]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank39]: return func(*args, **kwargs) [rank39]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank39]: return cdb.barrier(group=group, async_op=async_op) [rank39]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank39]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank39]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank39]: return func(*args, **kwargs) [rank39]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank39]: work = group.barrier(opts=opts) [rank39]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 39. [rank70]:[E217 15:05:51.644721394 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 70] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank70]:[E217 15:05:51.644746824 ProcessGroupNCCL.cpp:630] [Rank 70] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank70]:[E217 15:05:51.644752074 ProcessGroupNCCL.cpp:636] [Rank 70] To avoid data inconsistency, we are taking the entire process down. [rank61]:[E217 15:05:51.523113963 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 61] Process group watchdog thread terminated with exception: [Rank 61] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fba526bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7fba07a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fba07a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fba07a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7fba528175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7fba57094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fba57126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 61] Process group watchdog thread terminated with exception: [Rank 61] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fba526bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7fba07a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fba07a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fba07a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7fba528175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7fba57094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fba57126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fba526bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7fba076a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7fba528175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7fba57094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7fba57126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank84]:[E217 15:05:51.992959808 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 84] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank84]:[E217 15:05:51.992980609 ProcessGroupNCCL.cpp:630] [Rank 84] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank84]:[E217 15:05:51.992987149 ProcessGroupNCCL.cpp:636] [Rank 84] To avoid data inconsistency, we are taking the entire process down. [rank39]:[E217 15:05:51.889783655 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 39] Process group watchdog thread terminated with exception: [Rank 39] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b4fd6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f3b0542a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b05431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f3b0543361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f3b50a5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f3b54894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f3b54926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 39] Process group watchdog thread terminated with exception: [Rank 39] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b4fd6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f3b0542a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b05431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f3b0543361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f3b50a5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f3b54894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f3b54926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b4fd6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f3b050a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f3b50a5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f3b54894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f3b54926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank18]:[E217 15:05:51.127083181 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 18] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank18]:[E217 15:05:51.127110792 ProcessGroupNCCL.cpp:630] [Rank 18] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank18]:[E217 15:05:51.127116132 ProcessGroupNCCL.cpp:636] [Rank 18] To avoid data inconsistency, we are taking the entire process down. [rank113]: Traceback (most recent call last): [rank113]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank113]: train(attn_implementation="flash_attention_2") [rank113]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank113]: trainer.train() [rank113]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank113]: return inner_training_loop( [rank113]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank113]: self._maybe_log_save_evaluate( [rank113]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank113]: self._save_checkpoint(model, trial) [rank113]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank113]: self._save_optimizer_and_scheduler(output_dir) [rank113]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank113]: self.model_wrapped.save_checkpoint(output_dir) [rank113]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank113]: self._create_zero_checkpoint_files(save_dir, tag) [rank113]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank113]: dist.barrier(group=self.optimizer.dp_process_group) [rank113]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank113]: return func(*args, **kwargs) [rank113]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank113]: return cdb.barrier(group=group, async_op=async_op) [rank113]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank113]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank113]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank113]: return func(*args, **kwargs) [rank113]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank113]: work = group.barrier(opts=opts) [rank113]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 113. [rank63]: Traceback (most recent call last): [rank63]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank63]: train(attn_implementation="flash_attention_2") [rank63]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank63]: trainer.train() [rank63]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank63]: return inner_training_loop( [rank63]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank63]: self._maybe_log_save_evaluate( [rank63]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank63]: self._save_checkpoint(model, trial) [rank63]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank63]: self._save_optimizer_and_scheduler(output_dir) [rank63]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank63]: self.model_wrapped.save_checkpoint(output_dir) [rank63]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank63]: self._create_zero_checkpoint_files(save_dir, tag) [rank63]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank63]: dist.barrier(group=self.optimizer.dp_process_group) [rank63]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank63]: return func(*args, **kwargs) [rank63]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank63]: return cdb.barrier(group=group, async_op=async_op) [rank63]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank63]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank63]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank63]: return func(*args, **kwargs) [rank63]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank63]: work = group.barrier(opts=opts) [rank63]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 63. [rank89]: Traceback (most recent call last): [rank89]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank89]: train(attn_implementation="flash_attention_2") [rank89]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank89]: trainer.train() [rank89]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank89]: return inner_training_loop( [rank89]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank89]: self._maybe_log_save_evaluate( [rank89]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank89]: self._save_checkpoint(model, trial) [rank89]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank89]: self._save_optimizer_and_scheduler(output_dir) [rank89]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank89]: self.model_wrapped.save_checkpoint(output_dir) [rank89]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank89]: self._create_zero_checkpoint_files(save_dir, tag) [rank89]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank89]: dist.barrier(group=self.optimizer.dp_process_group) [rank89]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank89]: return func(*args, **kwargs) [rank89]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank89]: return cdb.barrier(group=group, async_op=async_op) [rank89]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank89]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank89]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank89]: return func(*args, **kwargs) [rank89]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank89]: work = group.barrier(opts=opts) [rank89]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 89. [rank91]: Traceback (most recent call last): [rank91]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank91]: train(attn_implementation="flash_attention_2") [rank91]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank91]: trainer.train() [rank91]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank91]: return inner_training_loop( [rank91]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank91]: self._maybe_log_save_evaluate( [rank91]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank91]: self._save_checkpoint(model, trial) [rank91]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank91]: self._save_optimizer_and_scheduler(output_dir) [rank91]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank91]: self.model_wrapped.save_checkpoint(output_dir) [rank91]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank91]: self._create_zero_checkpoint_files(save_dir, tag) [rank91]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank91]: dist.barrier(group=self.optimizer.dp_process_group) [rank91]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank91]: return func(*args, **kwargs) [rank91]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank91]: return cdb.barrier(group=group, async_op=async_op) [rank91]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank91]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank91]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank91]: return func(*args, **kwargs) [rank91]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank91]: work = group.barrier(opts=opts) [rank91]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 91. [rank115]: Traceback (most recent call last): [rank115]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank115]: train(attn_implementation="flash_attention_2") [rank115]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank115]: trainer.train() [rank115]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank115]: return inner_training_loop( [rank115]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank115]: self._maybe_log_save_evaluate( [rank115]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank115]: self._save_checkpoint(model, trial) [rank115]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank115]: self._save_optimizer_and_scheduler(output_dir) [rank115]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank115]: self.model_wrapped.save_checkpoint(output_dir) [rank115]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank115]: self._create_zero_checkpoint_files(save_dir, tag) [rank115]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank115]: dist.barrier(group=self.optimizer.dp_process_group) [rank115]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank115]: return func(*args, **kwargs) [rank115]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank115]: return cdb.barrier(group=group, async_op=async_op) [rank115]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank115]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank115]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank115]: return func(*args, **kwargs) [rank115]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank115]: work = group.barrier(opts=opts) [rank115]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 115. [rank91]:[E217 15:05:51.103151394 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 91] Process group watchdog thread terminated with exception: [Rank 91] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75002276c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x74ffd7e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x74ffd7e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank87]:[E217 15:05:51.016877066 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 87] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank87]:[E217 15:05:51.016902517 ProcessGroupNCCL.cpp:630] [Rank 87] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank87]:[E217 15:05:51.016907507 ProcessGroupNCCL.cpp:636] [Rank 87] To avoid data inconsistency, we are taking the entire process down. frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x74ffd7e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75002345c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x750027294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x750027326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank119]: Traceback (most recent call last): [rank119]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank119]: train(attn_implementation="flash_attention_2") [rank119]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank119]: trainer.train() [rank119]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank119]: return inner_training_loop( [rank119]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank119]: self._maybe_log_save_evaluate( [rank119]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank119]: self._save_checkpoint(model, trial) [rank119]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank119]: self._save_optimizer_and_scheduler(output_dir) [rank119]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank119]: self.model_wrapped.save_checkpoint(output_dir) [rank119]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank119]: self._create_zero_checkpoint_files(save_dir, tag) [rank119]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank119]: dist.barrier(group=self.optimizer.dp_process_group) [rank119]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank119]: return func(*args, **kwargs) what(): [PG ID 1 PG GUID 1 Rank 91] Process group watchdog thread terminated with exception: [Rank 91] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75002276c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x74ffd7e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x74ffd7e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank119]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank119]: return cdb.barrier(group=group, async_op=async_op) [rank119]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank119]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank119]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank119]: return func(*args, **kwargs) [rank119]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank119]: work = group.barrier(opts=opts) [rank119]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 119. frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x74ffd7e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75002345c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x750027294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x750027326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75002276c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x74ffd7aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x75002345c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x750027294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x750027326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank100]:[E217 15:05:51.535606555 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 100] Process group watchdog thread terminated with exception: [Rank 100] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b9334f76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b92ea22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b92ea231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b92ea23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b93356585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b9339894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b9339926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 100] Process group watchdog thread terminated with exception: [Rank 100] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b9334f76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b92ea22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b92ea231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b92ea23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b93356585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b9339894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b9339926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b9334f76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7b92e9ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7b93356585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7b9339894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7b9339926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank89]:[E217 15:05:51.106382131 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 89] Process group watchdog thread terminated with exception: [Rank 89] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71818bac1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x718140e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x718140e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x718140e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71818bc1c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x718190494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x718190526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 89] Process group watchdog thread terminated with exception: [Rank 89] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71818bac1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x718140e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x718140e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x718140e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71818bc1c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x718190494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x718190526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71818bac1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x718140aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x71818bc1c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x718190494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x718190526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank82]: Traceback (most recent call last): [rank82]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank82]: train(attn_implementation="flash_attention_2") [rank82]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank82]: trainer.train() [rank82]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank82]: return inner_training_loop( [rank82]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank82]: self._maybe_log_save_evaluate( [rank82]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank82]: self._save_checkpoint(model, trial) [rank82]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank82]: self._save_optimizer_and_scheduler(output_dir) [rank82]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank82]: self.model_wrapped.save_checkpoint(output_dir) [rank82]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank82]: self._create_zero_checkpoint_files(save_dir, tag) [rank82]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank82]: dist.barrier(group=self.optimizer.dp_process_group) [rank82]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank82]: return func(*args, **kwargs) [rank82]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank82]: return cdb.barrier(group=group, async_op=async_op) [rank82]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank82]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank82]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank82]: return func(*args, **kwargs) [rank82]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank82]: work = group.barrier(opts=opts) [rank82]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 82. [rank102]:[E217 15:05:51.539777715 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 102] Process group watchdog thread terminated with exception: [Rank 102] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600094 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a32f13b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a32a662a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a32a6631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a32a663361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a32f1e675c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a32f5c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a32f5d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 102] Process group watchdog thread terminated with exception: [Rank 102] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600094 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a32f13b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a32a662a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a32a6631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a32a663361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a32f1e675c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a32f5c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a32f5d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a32f13b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7a32a62a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7a32f1e675c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7a32f5c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7a32f5d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank82]:[E217 15:05:51.029524889 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 82] Process group watchdog thread terminated with exception: [Rank 82] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73e30d6c1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73e2c2a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73e2c2a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73e2c2a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73e30d81c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73e312094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73e312126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 82] Process group watchdog thread terminated with exception: [Rank 82] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73e30d6c1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73e2c2a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73e2c2a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73e2c2a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73e30d81c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73e312094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73e312126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73e30d6c1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x73e2c26a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x73e30d81c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x73e312094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x73e312126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank84]:[E217 15:05:51.040188320 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 84] Process group watchdog thread terminated with exception: [Rank 84] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7da417d93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7da3cd02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7da3cd031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7da3cd03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7da417eee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7da41c694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7da41c726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank66]:[E217 15:05:51.702188591 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 66] Process group watchdog thread terminated with exception: [Rank 66] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a02781b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a022d42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a022d431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a022d43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a0278c665c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a027ca94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a027cb26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank70]: Traceback (most recent call last): [rank70]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank70]: train(attn_implementation="flash_attention_2") [rank70]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank70]: trainer.train() [rank70]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank70]: return inner_training_loop( [rank70]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank70]: self._maybe_log_save_evaluate( [rank70]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank70]: self._save_checkpoint(model, trial) [rank70]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank70]: self._save_optimizer_and_scheduler(output_dir) [rank70]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank70]: self.model_wrapped.save_checkpoint(output_dir) [rank70]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank70]: self._create_zero_checkpoint_files(save_dir, tag) what(): [PG ID 1 PG GUID 1 Rank 84] Process group watchdog thread terminated with exception: [Rank 84] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7da417d93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7da3cd02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7da3cd031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank70]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank70]: dist.barrier(group=self.optimizer.dp_process_group) [rank70]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank70]: return func(*args, **kwargs) [rank70]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank70]: return cdb.barrier(group=group, async_op=async_op) [rank70]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank70]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank70]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank70]: return func(*args, **kwargs) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7da3cd03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7da417eee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7da41c694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7da41c726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7da417d93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7da3ccca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank70]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier frame #2: + 0x145c0 (0x7da417eee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7da41c694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7da41c726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank70]: work = group.barrier(opts=opts) [rank70]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 70. terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 66] Process group watchdog thread terminated with exception: [Rank 66] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a02781b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a022d42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a022d431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a022d43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a0278c665c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a027ca94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a027cb26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a02781b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7a022d0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7a0278c665c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7a027ca94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7a027cb26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank97]:[E217 15:05:51.563789994 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 97] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank97]:[E217 15:05:51.563814775 ProcessGroupNCCL.cpp:630] [Rank 97] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank97]:[E217 15:05:51.563819745 ProcessGroupNCCL.cpp:636] [Rank 97] To avoid data inconsistency, we are taking the entire process down. [rank117]:[E217 15:05:51.848712099 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 117] Process group watchdog thread terminated with exception: [Rank 117] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x702eead6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x702ea042a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x702ea0431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x702ea043361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x702eebe5d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x702eefa94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x702eefb26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 117] Process group watchdog thread terminated with exception: [Rank 117] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x702eead6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x702ea042a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x702ea0431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x702ea043361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x702eebe5d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x702eefa94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x702eefb26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x702eead6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x702ea00a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x702eebe5d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x702eefa94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x702eefb26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank70]:[E217 15:05:51.713929521 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 70] Process group watchdog thread terminated with exception: [Rank 70] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x741a0c393446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7419c162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7419c1631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7419c163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x741a0c4ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x741a10c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x741a10d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 70] Process group watchdog thread terminated with exception: [Rank 70] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x741a0c393446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7419c162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7419c1631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7419c163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x741a0c4ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x741a10c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x741a10d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x741a0c393446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7419c12a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x741a0c4ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x741a10c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x741a10d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank81]:[E217 15:05:51.061476222 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 81] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank81]:[E217 15:05:51.061501213 ProcessGroupNCCL.cpp:630] [Rank 81] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank81]:[E217 15:05:51.061507113 ProcessGroupNCCL.cpp:636] [Rank 81] To avoid data inconsistency, we are taking the entire process down. [rank85]:[E217 15:05:51.062097877 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 85] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank85]:[E217 15:05:51.062118088 ProcessGroupNCCL.cpp:630] [Rank 85] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank85]:[E217 15:05:51.062122969 ProcessGroupNCCL.cpp:636] [Rank 85] To avoid data inconsistency, we are taking the entire process down. [rank18]: Traceback (most recent call last): [rank18]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank18]: train(attn_implementation="flash_attention_2") [rank18]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank18]: trainer.train() [rank18]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank18]: return inner_training_loop( [rank18]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank18]: self._maybe_log_save_evaluate( [rank18]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank18]: self._save_checkpoint(model, trial) [rank18]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank18]: self._save_optimizer_and_scheduler(output_dir) [rank18]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank18]: self.model_wrapped.save_checkpoint(output_dir) [rank18]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank18]: self._create_zero_checkpoint_files(save_dir, tag) [rank18]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank18]: dist.barrier(group=self.optimizer.dp_process_group) [rank18]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank18]: return func(*args, **kwargs) [rank18]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank18]: return cdb.barrier(group=group, async_op=async_op) [rank18]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank18]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank18]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank18]: return func(*args, **kwargs) [rank18]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank18]: work = group.barrier(opts=opts) [rank18]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 18. [rank83]:[E217 15:05:51.067077669 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 83] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank83]:[E217 15:05:51.067108640 ProcessGroupNCCL.cpp:630] [Rank 83] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank83]:[E217 15:05:51.067113131 ProcessGroupNCCL.cpp:636] [Rank 83] To avoid data inconsistency, we are taking the entire process down. [rank87]: Traceback (most recent call last): [rank87]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank87]: train(attn_implementation="flash_attention_2") [rank87]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank87]: trainer.train() [rank87]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank87]: return inner_training_loop( [rank87]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank87]: self._maybe_log_save_evaluate( [rank87]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank87]: self._save_checkpoint(model, trial) [rank87]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank87]: self._save_optimizer_and_scheduler(output_dir) [rank87]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank87]: self.model_wrapped.save_checkpoint(output_dir) [rank87]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank87]: self._create_zero_checkpoint_files(save_dir, tag) [rank87]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank87]: dist.barrier(group=self.optimizer.dp_process_group) [rank87]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank87]: return func(*args, **kwargs) [rank87]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank87]: return cdb.barrier(group=group, async_op=async_op) [rank87]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank87]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank87]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank87]: return func(*args, **kwargs) [rank87]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank87]: work = group.barrier(opts=opts) [rank87]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 87. [rank95]:[E217 15:05:51.156200164 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 95] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank95]:[E217 15:05:51.156229316 ProcessGroupNCCL.cpp:630] [Rank 95] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank95]:[E217 15:05:51.156234286 ProcessGroupNCCL.cpp:636] [Rank 95] To avoid data inconsistency, we are taking the entire process down. [rank93]:[E217 15:05:51.167348658 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 93] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank93]:[E217 15:05:51.167371400 ProcessGroupNCCL.cpp:630] [Rank 93] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank93]:[E217 15:05:51.167376090 ProcessGroupNCCL.cpp:636] [Rank 93] To avoid data inconsistency, we are taking the entire process down. [rank87]:[E217 15:05:51.081113747 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 87] Process group watchdog thread terminated with exception: [Rank 87] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70fddf76c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70fd94e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70fd94e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70fd94e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70fde045c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70fde4294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70fde4326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 87] Process group watchdog thread terminated with exception: [Rank 87] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70fddf76c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70fd94e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70fd94e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70fd94e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70fde045c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70fde4294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70fde4326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70fddf76c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x70fd94aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x70fde045c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x70fde4294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x70fde4326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank113]:[E217 15:05:51.895403226 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 113] Process group watchdog thread terminated with exception: [Rank 113] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x705c3556c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x705beac2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x705beac31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x705beac3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x705c366635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x705c3a294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x705c3a326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank63]:[E217 15:05:51.634781591 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 63] Process group watchdog thread terminated with exception: [Rank 63] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7517effb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7517a522a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7517a5231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) what(): [PG ID 1 PG GUID 1 Rank 113] Process group watchdog thread terminated with exception: [Rank 113] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x705c3556c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x705beac2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x705beac31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7517a523361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7517f0a555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7517f4894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7517f4926850 in /lib/x86_64-linux-gnu/libc.so.6) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x705beac3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x705c366635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x705c3a294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x705c3a326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x705c3556c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x705bea8a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) terminate called after throwing an instance of 'c10::DistBackendError' frame #2: + 0x145c0 (0x705c366635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x705c3a294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x705c3a326850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 63] Process group watchdog thread terminated with exception: [Rank 63] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7517effb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7517a522a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7517a5231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7517a523361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7517f0a555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7517f4894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7517f4926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7517effb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7517a4ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7517f0a555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7517f4894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7517f4926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank97]: Traceback (most recent call last): [rank97]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank97]: train(attn_implementation="flash_attention_2") [rank97]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank97]: trainer.train() [rank97]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank97]: return inner_training_loop( [rank97]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank97]: self._maybe_log_save_evaluate( [rank97]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank97]: self._save_checkpoint(model, trial) [rank97]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank97]: self._save_optimizer_and_scheduler(output_dir) [rank97]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank97]: self.model_wrapped.save_checkpoint(output_dir) [rank97]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank97]: self._create_zero_checkpoint_files(save_dir, tag) [rank97]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank97]: dist.barrier(group=self.optimizer.dp_process_group) [rank97]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank97]: return func(*args, **kwargs) [rank97]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank97]: return cdb.barrier(group=group, async_op=async_op) [rank97]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank97]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank97]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank97]: return func(*args, **kwargs) [rank97]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank97]: work = group.barrier(opts=opts) [rank97]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 97. [rank85]: Traceback (most recent call last): [rank85]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank85]: train(attn_implementation="flash_attention_2") [rank85]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank85]: trainer.train() [rank85]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank85]: return inner_training_loop( [rank85]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank85]: self._maybe_log_save_evaluate( [rank85]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank85]: self._save_checkpoint(model, trial) [rank85]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank85]: self._save_optimizer_and_scheduler(output_dir) [rank85]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank85]: self.model_wrapped.save_checkpoint(output_dir) [rank85]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank85]: self._create_zero_checkpoint_files(save_dir, tag) [rank85]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank85]: dist.barrier(group=self.optimizer.dp_process_group) [rank85]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank85]: return func(*args, **kwargs) [rank85]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank85]: return cdb.barrier(group=group, async_op=async_op) [rank85]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank85]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank85]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank85]: return func(*args, **kwargs) [rank85]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank85]: work = group.barrier(opts=opts) [rank85]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 85. [rank81]: Traceback (most recent call last): [rank81]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank81]: train(attn_implementation="flash_attention_2") [rank81]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank81]: trainer.train() [rank81]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank81]: return inner_training_loop( [rank81]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank81]: self._maybe_log_save_evaluate( [rank81]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank81]: self._save_checkpoint(model, trial) [rank81]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank81]: self._save_optimizer_and_scheduler(output_dir) [rank81]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank81]: self.model_wrapped.save_checkpoint(output_dir) [rank81]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank81]: self._create_zero_checkpoint_files(save_dir, tag) [rank81]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank81]: dist.barrier(group=self.optimizer.dp_process_group) [rank81]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank81]: return func(*args, **kwargs) [rank97]:[E217 15:05:51.632442228 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 97] Process group watchdog thread terminated with exception: [Rank 97] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600095 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7717c3b6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77177922a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x771779231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank81]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank81]: return cdb.barrier(group=group, async_op=async_op) [rank81]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank81]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank81]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank81]: return func(*args, **kwargs) [rank81]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank81]: work = group.barrier(opts=opts) [rank81]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 81. frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77177923361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7717c485c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7717c8894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7717c8926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 97] Process group watchdog thread terminated with exception: [Rank 97] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600095 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7717c3b6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77177922a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x771779231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77177923361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7717c485c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7717c8894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7717c8926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7717c3b6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x771778ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7717c485c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7717c8894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7717c8926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank95]: Traceback (most recent call last): [rank95]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank95]: train(attn_implementation="flash_attention_2") [rank95]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank95]: trainer.train() [rank95]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank95]: return inner_training_loop( [rank95]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank95]: self._maybe_log_save_evaluate( [rank95]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank95]: self._save_checkpoint(model, trial) [rank95]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank95]: self._save_optimizer_and_scheduler(output_dir) [rank95]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank95]: self.model_wrapped.save_checkpoint(output_dir) [rank95]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank95]: self._create_zero_checkpoint_files(save_dir, tag) [rank95]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank95]: dist.barrier(group=self.optimizer.dp_process_group) [rank95]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank95]: return func(*args, **kwargs) [rank95]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank95]: return cdb.barrier(group=group, async_op=async_op) [rank95]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank95]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank95]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank95]: return func(*args, **kwargs) [rank95]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank95]: work = group.barrier(opts=opts) [rank95]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 95. [rank83]: Traceback (most recent call last): [rank83]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank83]: train(attn_implementation="flash_attention_2") [rank83]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank83]: trainer.train() [rank83]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank83]: return inner_training_loop( [rank83]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank83]: self._maybe_log_save_evaluate( [rank83]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank83]: self._save_checkpoint(model, trial) [rank83]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank83]: self._save_optimizer_and_scheduler(output_dir) [rank83]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank83]: self.model_wrapped.save_checkpoint(output_dir) [rank83]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank83]: self._create_zero_checkpoint_files(save_dir, tag) [rank83]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank83]: dist.barrier(group=self.optimizer.dp_process_group) [rank83]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank83]: return func(*args, **kwargs) [rank83]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank83]: return cdb.barrier(group=group, async_op=async_op) [rank83]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank83]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank83]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank83]: return func(*args, **kwargs) [rank83]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank83]: work = group.barrier(opts=opts) [rank83]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 83. [rank93]: Traceback (most recent call last): [rank93]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank93]: train(attn_implementation="flash_attention_2") [rank93]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank93]: trainer.train() [rank93]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank93]: return inner_training_loop( [rank93]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank93]: self._maybe_log_save_evaluate( [rank93]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank93]: self._save_checkpoint(model, trial) [rank93]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank93]: self._save_optimizer_and_scheduler(output_dir) [rank93]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank93]: self.model_wrapped.save_checkpoint(output_dir) [rank93]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank93]: self._create_zero_checkpoint_files(save_dir, tag) [rank93]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank93]: dist.barrier(group=self.optimizer.dp_process_group) [rank93]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank93]: return func(*args, **kwargs) [rank93]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank93]: return cdb.barrier(group=group, async_op=async_op) [rank93]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank93]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank93]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank93]: return func(*args, **kwargs) [rank93]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank93]: work = group.barrier(opts=opts) [rank93]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 93. [rank20]:[E217 15:05:51.251401111 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 20] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank20]:[E217 15:05:51.251423612 ProcessGroupNCCL.cpp:630] [Rank 20] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank20]:[E217 15:05:51.251428762 ProcessGroupNCCL.cpp:636] [Rank 20] To avoid data inconsistency, we are taking the entire process down. [rank95]:[E217 15:05:51.216566733 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 95] Process group watchdog thread terminated with exception: [Rank 95] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77840b8af446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7783c0c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7783c0c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7783c0c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77840ba165c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x778410294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x778410326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 95] Process group watchdog thread terminated with exception: [Rank 95] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77840b8af446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7783c0c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7783c0c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7783c0c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77840ba165c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x778410294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x778410326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77840b8af446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7783c08a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x77840ba165c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x778410294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x778410326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank115]:[E217 15:05:51.933336282 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 115] Process group watchdog thread terminated with exception: [Rank 115] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x738869193446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73881e42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73881e431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73881e43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7388692ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73886da94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73886db26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 115] Process group watchdog thread terminated with exception: [Rank 115] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x738869193446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73881e42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73881e431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73881e43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7388692ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73886da94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73886db26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x738869193446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x73881e0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7388692ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x73886da94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x73886db26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank93]:[E217 15:05:51.222047464 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 93] Process group watchdog thread terminated with exception: [Rank 93] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73eb04b93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73eab9e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73eab9e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73eab9e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73eb04cee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73eb09494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73eb09526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 93] Process group watchdog thread terminated with exception: [Rank 93] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73eb04b93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73eab9e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73eab9e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73eab9e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73eb04cee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73eb09494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73eb09526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73eb04b93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x73eab9aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x73eb04cee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x73eb09494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x73eb09526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank17]:[E217 15:05:51.259894064 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 17] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank17]:[E217 15:05:51.259915275 ProcessGroupNCCL.cpp:630] [Rank 17] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank17]:[E217 15:05:51.259921415 ProcessGroupNCCL.cpp:636] [Rank 17] To avoid data inconsistency, we are taking the entire process down. [rank119]:[E217 15:05:51.935934166 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 119] Process group watchdog thread terminated with exception: [Rank 119] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72b05c3b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72b01162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72b011631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72b01163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72b05ce555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72b060c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72b060d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 119] Process group watchdog thread terminated with exception: [Rank 119] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72b05c3b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72b01162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72b011631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank19]:[E217 15:05:51.260950298 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 19] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank19]:[E217 15:05:51.260971849 ProcessGroupNCCL.cpp:630] [Rank 19] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank19]:[E217 15:05:51.260979019 ProcessGroupNCCL.cpp:636] [Rank 19] To avoid data inconsistency, we are taking the entire process down. frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72b01163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72b05ce555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72b060c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72b060d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72b05c3b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x72b0112a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x72b05ce555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x72b060c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x72b060d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank21]:[E217 15:05:51.263763971 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 21] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank21]:[E217 15:05:51.263790113 ProcessGroupNCCL.cpp:630] [Rank 21] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank21]:[E217 15:05:51.263796873 ProcessGroupNCCL.cpp:636] [Rank 21] To avoid data inconsistency, we are taking the entire process down. [rank18]:[E217 15:05:51.265142142 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 18] Process group watchdog thread terminated with exception: [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e26ca96c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e268002a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e2680031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e268003361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e26cade25c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e26cf694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e26cf726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 18] Process group watchdog thread terminated with exception: [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e26ca96c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e268002a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e2680031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e268003361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e26cade25c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e26cf694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e26cf726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e26ca96c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7e267fca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7e26cade25c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7e26cf694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7e26cf726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank23]:[E217 15:05:51.265801765 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 23] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank23]:[E217 15:05:51.265820006 ProcessGroupNCCL.cpp:630] [Rank 23] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank23]:[E217 15:05:51.265824246 ProcessGroupNCCL.cpp:636] [Rank 23] To avoid data inconsistency, we are taking the entire process down. [rank58]:[E217 15:05:51.713300281 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 58] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank58]:[E217 15:05:51.713322431 ProcessGroupNCCL.cpp:630] [Rank 58] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank58]:[E217 15:05:51.713327272 ProcessGroupNCCL.cpp:636] [Rank 58] To avoid data inconsistency, we are taking the entire process down. [rank20]:[E217 15:05:51.299848702 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 20] Process group watchdog thread terminated with exception: [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600075 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x762bb356c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x762b68c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x762b68c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x762b68c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x762bb39635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x762bb8094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x762bb8126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 20] Process group watchdog thread terminated with exception: [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600075 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x762bb356c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x762b68c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x762b68c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x762b68c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x762bb39635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x762bb8094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x762bb8126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x762bb356c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x762b688a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x762bb39635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x762bb8094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x762bb8126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank60]:[E217 15:05:51.715158197 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 60] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank60]:[E217 15:05:51.715181778 ProcessGroupNCCL.cpp:630] [Rank 60] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank60]:[E217 15:05:51.715186788 ProcessGroupNCCL.cpp:636] [Rank 60] To avoid data inconsistency, we are taking the entire process down. [rank62]:[E217 15:05:51.715316875 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 62] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank62]:[E217 15:05:51.715337196 ProcessGroupNCCL.cpp:630] [Rank 62] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank62]:[E217 15:05:51.715342056 ProcessGroupNCCL.cpp:636] [Rank 62] To avoid data inconsistency, we are taking the entire process down. [rank83]:[E217 15:05:51.177641496 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 83] Process group watchdog thread terminated with exception: [Rank 83] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70b3a5bb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70b35ae2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70b35ae31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70b35ae3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70b3a666c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70b3aa494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70b3aa526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank19]:[E217 15:05:51.301220452 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 19] Process group watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x779380ea3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77933622a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x779336231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77933623361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77938185c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x779385894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x779385926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 83] Process group watchdog thread terminated with exception: [Rank 83] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600047 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70b3a5bb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70b35ae2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70b35ae31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70b35ae3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70b3a666c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70b3aa494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70b3aa526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70b3a5bb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x70b35aaa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x70b3a666c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x70b3aa494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x70b3aa526850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 19] Process group watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600080 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x779380ea3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77933622a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x779336231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77933623361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77938185c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x779385894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x779385926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x779380ea3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x779335ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x77938185c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x779385894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x779385926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank17]: Traceback (most recent call last): [rank17]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank17]: train(attn_implementation="flash_attention_2") [rank17]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank17]: trainer.train() [rank17]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank17]: return inner_training_loop( [rank17]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank17]: self._maybe_log_save_evaluate( [rank17]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank17]: self._save_checkpoint(model, trial) [rank17]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank17]: self._save_optimizer_and_scheduler(output_dir) [rank17]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank17]: self.model_wrapped.save_checkpoint(output_dir) [rank17]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank17]: self._create_zero_checkpoint_files(save_dir, tag) [rank17]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank17]: dist.barrier(group=self.optimizer.dp_process_group) [rank17]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank17]: return func(*args, **kwargs) [rank17]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank17]: return cdb.barrier(group=group, async_op=async_op) [rank17]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank17]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank17]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank17]: return func(*args, **kwargs) [rank17]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank17]: work = group.barrier(opts=opts) [rank17]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 17. [rank23]:[E217 15:05:51.311368580 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 23] Process group watchdog thread terminated with exception: [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e86d0f6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e868662a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e8686631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e868663361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e86d1c5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e86d5c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e86d5d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 23] Process group watchdog thread terminated with exception: [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e86d0f6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e868662a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e8686631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e868663361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e86d1c5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e86d5c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e86d5d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e86d0f6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7e86862a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7e86d1c5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7e86d5c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7e86d5d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank67]:[E217 15:05:51.852244073 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 67] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank67]:[E217 15:05:51.852270834 ProcessGroupNCCL.cpp:630] [Rank 67] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank67]:[E217 15:05:51.852276954 ProcessGroupNCCL.cpp:636] [Rank 67] To avoid data inconsistency, we are taking the entire process down. [rank71]:[E217 15:05:51.852949854 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 71] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank71]:[E217 15:05:51.852967564 ProcessGroupNCCL.cpp:630] [Rank 71] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank71]:[E217 15:05:51.852973035 ProcessGroupNCCL.cpp:636] [Rank 71] To avoid data inconsistency, we are taking the entire process down. [rank65]:[E217 15:05:51.853056416 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 65] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank65]:[E217 15:05:51.853084626 ProcessGroupNCCL.cpp:630] [Rank 65] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank65]:[E217 15:05:51.853091066 ProcessGroupNCCL.cpp:636] [Rank 65] To avoid data inconsistency, we are taking the entire process down. [rank21]: Traceback (most recent call last): [rank21]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank21]: train(attn_implementation="flash_attention_2") [rank21]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank21]: trainer.train() [rank21]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank21]: return inner_training_loop( [rank21]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank21]: self._maybe_log_save_evaluate( [rank21]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank21]: self._save_checkpoint(model, trial) [rank21]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank21]: self._save_optimizer_and_scheduler(output_dir) [rank21]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank21]: self.model_wrapped.save_checkpoint(output_dir) [rank21]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank21]: self._create_zero_checkpoint_files(save_dir, tag) [rank21]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank21]: dist.barrier(group=self.optimizer.dp_process_group) [rank21]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank21]: return func(*args, **kwargs) [rank21]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank21]: return cdb.barrier(group=group, async_op=async_op) [rank21]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank21]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank21]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank21]: return func(*args, **kwargs) [rank21]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank21]: work = group.barrier(opts=opts) [rank21]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 21. [rank69]:[E217 15:05:51.853675765 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 69] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank69]:[E217 15:05:51.853690435 ProcessGroupNCCL.cpp:630] [Rank 69] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank69]:[E217 15:05:51.853694575 ProcessGroupNCCL.cpp:636] [Rank 69] To avoid data inconsistency, we are taking the entire process down. [rank17]:[E217 15:05:51.321478525 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 17] Process group watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78f4b596c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78f46b02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78f46b031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78f46b03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78f4b665c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78f4ba694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78f4ba726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 17] Process group watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78f4b596c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78f46b02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78f46b031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78f46b03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78f4b665c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78f4ba694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78f4ba726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78f4b596c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x78f46aca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x78f4b665c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x78f4ba694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x78f4ba726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank81]:[E217 15:05:51.200441290 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 81] Process group watchdog thread terminated with exception: [Rank 81] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a01fcea3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a01b222a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a01b2231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a01b223361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a01fd85c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a0201894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a0201926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 81] Process group watchdog thread terminated with exception: [Rank 81] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a01fcea3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a01b222a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a01b2231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a01b223361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a01fd85c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a0201894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a0201926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a01fcea3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7a01b1ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7a01fd85c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7a0201894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7a0201926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank21]:[E217 15:05:51.326847209 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 21] Process group watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c788209f446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c783742a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c7837431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c783743361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c78821fa5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c7886a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c7886b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 21] Process group watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c788209f446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c783742a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c7837431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c783743361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c78821fa5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c7886a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c7886b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c788209f446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7c78370a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7c78821fa5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7c7886a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7c7886b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank99]:[E217 15:05:51.724580182 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 99] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank99]:[E217 15:05:51.724604502 ProcessGroupNCCL.cpp:630] [Rank 99] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank99]:[E217 15:05:51.724609632 ProcessGroupNCCL.cpp:636] [Rank 99] To avoid data inconsistency, we are taking the entire process down. [rank101]:[E217 15:05:51.728156270 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 101] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank101]:[E217 15:05:51.728177480 ProcessGroupNCCL.cpp:630] [Rank 101] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank101]:[E217 15:05:51.728183931 ProcessGroupNCCL.cpp:636] [Rank 101] To avoid data inconsistency, we are taking the entire process down. [rank103]:[E217 15:05:51.729726070 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 103] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank103]:[E217 15:05:51.729745681 ProcessGroupNCCL.cpp:630] [Rank 103] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank103]:[E217 15:05:51.729750601 ProcessGroupNCCL.cpp:636] [Rank 103] To avoid data inconsistency, we are taking the entire process down. [rank92]:[E217 15:05:51.308973766 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 92] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank92]:[E217 15:05:51.308995117 ProcessGroupNCCL.cpp:630] [Rank 92] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank92]:[E217 15:05:51.309022689 ProcessGroupNCCL.cpp:636] [Rank 92] To avoid data inconsistency, we are taking the entire process down. [rank94]:[E217 15:05:51.309902645 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 94] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank94]:[E217 15:05:51.309924967 ProcessGroupNCCL.cpp:630] [Rank 94] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank94]:[E217 15:05:51.309929487 ProcessGroupNCCL.cpp:636] [Rank 94] To avoid data inconsistency, we are taking the entire process down. [rank90]:[E217 15:05:51.311131664 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 90] Timeout at NCCL work: 66413, last enqueued NCCL work: 66413, last completed NCCL work: 66412. [rank90]:[E217 15:05:51.311150415 ProcessGroupNCCL.cpp:630] [Rank 90] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank90]:[E217 15:05:51.311154725 ProcessGroupNCCL.cpp:636] [Rank 90] To avoid data inconsistency, we are taking the entire process down. [rank58]: Traceback (most recent call last): [rank58]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank58]: train(attn_implementation="flash_attention_2") [rank58]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank58]: trainer.train() [rank58]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank58]: return inner_training_loop( [rank58]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank58]: self._maybe_log_save_evaluate( [rank58]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank58]: self._save_checkpoint(model, trial) [rank58]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank58]: self._save_optimizer_and_scheduler(output_dir) [rank58]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank58]: self.model_wrapped.save_checkpoint(output_dir) [rank58]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank58]: self._create_zero_checkpoint_files(save_dir, tag) [rank58]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank58]: dist.barrier(group=self.optimizer.dp_process_group) [rank58]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank58]: return func(*args, **kwargs) [rank58]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank58]: return cdb.barrier(group=group, async_op=async_op) [rank58]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank58]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank58]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank58]: return func(*args, **kwargs) [rank58]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank58]: work = group.barrier(opts=opts) [rank58]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 58. [rank67]:[E217 15:05:51.893238699 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 67] Process group watchdog thread terminated with exception: [Rank 67] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e204f2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e200462a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e2004631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e200463361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e204fe525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e2053c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e2053d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [rank69]:[E217 15:05:51.893745967 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 69] Process group watchdog thread terminated with exception: [Rank 69] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f1c9596c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f1c4b02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f1c4b031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f1c4b03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f1c95de45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f1c9a694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f1c9a726850 in /lib/x86_64-linux-gnu/libc.so.6) [PG ID 1 PG GUID 1 Rank 67] Process group watchdog thread terminated with exception: [Rank 67] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e204f2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e200462a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e2004631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e200463361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e204fe525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e2053c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e2053d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e204f2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7e20042a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7e204fe525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7e2053c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7e2053d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 69] Process group watchdog thread terminated with exception: [Rank 69] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f1c9596c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f1c4b02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f1c4b031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f1c4b03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f1c95de45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f1c9a694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f1c9a726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f1c9596c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f1c4aca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f1c95de45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f1c9a694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f1c9a726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank85]:[E217 15:05:51.232897044 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 85] Process group watchdog thread terminated with exception: [Rank 85] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a869396c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a864902a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a8649031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a864903361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a8694a595c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a8698494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a8698526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 85] Process group watchdog thread terminated with exception: [Rank 85] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a869396c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a864902a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a8649031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a864903361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a8694a595c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a8698494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a8698526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a869396c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7a8648ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank60]: Traceback (most recent call last): [rank60]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank60]: train(attn_implementation="flash_attention_2") [rank60]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank60]: trainer.train() [rank60]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank60]: return inner_training_loop( [rank60]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank60]: self._maybe_log_save_evaluate( [rank60]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank60]: self._save_checkpoint(model, trial) frame #2: + 0x145c0 (0x7a8694a595c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7a8698494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7a8698526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank60]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank60]: self._save_optimizer_and_scheduler(output_dir) [rank60]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank60]: self.model_wrapped.save_checkpoint(output_dir) [rank60]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank60]: self._create_zero_checkpoint_files(save_dir, tag) [rank60]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank60]: dist.barrier(group=self.optimizer.dp_process_group) [rank60]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank60]: return func(*args, **kwargs) [rank60]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank60]: return cdb.barrier(group=group, async_op=async_op) [rank60]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank60]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank60]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank60]: return func(*args, **kwargs) [rank60]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank60]: work = group.barrier(opts=opts) [rank60]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 60. [rank62]: Traceback (most recent call last): [rank62]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank62]: train(attn_implementation="flash_attention_2") [rank62]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank62]: trainer.train() [rank62]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank62]: return inner_training_loop( [rank62]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank62]: self._maybe_log_save_evaluate( [rank62]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank62]: self._save_checkpoint(model, trial) [rank62]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank62]: self._save_optimizer_and_scheduler(output_dir) [rank62]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank62]: self.model_wrapped.save_checkpoint(output_dir) [rank62]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank62]: self._create_zero_checkpoint_files(save_dir, tag) [rank62]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank62]: dist.barrier(group=self.optimizer.dp_process_group) [rank62]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank62]: return func(*args, **kwargs) [rank62]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank62]: return cdb.barrier(group=group, async_op=async_op) [rank62]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank62]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank62]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank62]: return func(*args, **kwargs) [rank62]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank62]: work = group.barrier(opts=opts) [rank62]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 62. [rank65]: Traceback (most recent call last): [rank65]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank65]: train(attn_implementation="flash_attention_2") [rank65]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank65]: trainer.train() [rank65]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank65]: return inner_training_loop( [rank65]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank65]: self._maybe_log_save_evaluate( [rank65]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank65]: self._save_checkpoint(model, trial) [rank65]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank65]: self._save_optimizer_and_scheduler(output_dir) [rank65]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank65]: self.model_wrapped.save_checkpoint(output_dir) [rank65]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank65]: self._create_zero_checkpoint_files(save_dir, tag) [rank65]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank65]: dist.barrier(group=self.optimizer.dp_process_group) [rank65]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank65]: return func(*args, **kwargs) [rank65]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank65]: return cdb.barrier(group=group, async_op=async_op) [rank65]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank65]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank65]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank65]: return func(*args, **kwargs) [rank65]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank65]: work = group.barrier(opts=opts) [rank65]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 65. [rank58]:[E217 15:05:51.779930214 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 58] Process group watchdog thread terminated with exception: [Rank 58] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78504136c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x784ff6a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x784ff6a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x784ff6a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78504205c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x785045e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x785045f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 58] Process group watchdog thread terminated with exception: [Rank 58] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78504136c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x784ff6a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x784ff6a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x784ff6a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78504205c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x785045e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x785045f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78504136c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x784ff66a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x78504205c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x785045e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x785045f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank62]:[E217 15:05:51.782397182 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 62] Process group watchdog thread terminated with exception: [Rank 62] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79e32eb6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x79e2e422a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x79e2e4231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x79e2e423361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x79e32f85c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x79e333694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x79e333726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 62] Process group watchdog thread terminated with exception: [Rank 62] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79e32eb6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x79e2e422a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x79e2e4231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x79e2e423361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x79e32f85c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x79e333694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x79e333726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79e32eb6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x79e2e3ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x79e32f85c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x79e333694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x79e333726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank71]: Traceback (most recent call last): [rank71]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank71]: train(attn_implementation="flash_attention_2") [rank71]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank71]: trainer.train() [rank71]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank71]: return inner_training_loop( [rank71]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank71]: self._maybe_log_save_evaluate( [rank71]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank71]: self._save_checkpoint(model, trial) [rank71]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank71]: self._save_optimizer_and_scheduler(output_dir) [rank71]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank71]: self.model_wrapped.save_checkpoint(output_dir) [rank71]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank71]: self._create_zero_checkpoint_files(save_dir, tag) [rank71]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank71]: dist.barrier(group=self.optimizer.dp_process_group) [rank71]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank71]: return func(*args, **kwargs) [rank71]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank71]: return cdb.barrier(group=group, async_op=async_op) [rank71]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank71]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank71]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank71]: return func(*args, **kwargs) [rank71]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank71]: work = group.barrier(opts=opts) [rank71]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 71. [rank101]:[E217 15:05:51.767115106 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 101] Process group watchdog thread terminated with exception: [Rank 101] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76850716c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7684bc82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7684bc831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7684bc83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7685075635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x76850bc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x76850bd26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 101] Process group watchdog thread terminated with exception: [Rank 101] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76850716c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7684bc82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7684bc831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7684bc83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7685075635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x76850bc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x76850bd26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76850716c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7684bc4a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7685075635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x76850bc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x76850bd26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank71]:[E217 15:05:51.919296297 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 71] Process group watchdog thread terminated with exception: [Rank 71] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72f5b236c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72f567a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72f567a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72f567a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72f5b27635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72f5b6e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72f5b6f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank99]: Traceback (most recent call last): [rank99]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank99]: train(attn_implementation="flash_attention_2") [rank99]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank99]: trainer.train() [rank99]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank99]: return inner_training_loop( [rank99]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank99]: self._maybe_log_save_evaluate( [rank99]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank99]: self._save_checkpoint(model, trial) [rank99]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank99]: self._save_optimizer_and_scheduler(output_dir) [rank99]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank99]: self.model_wrapped.save_checkpoint(output_dir) [rank99]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank99]: self._create_zero_checkpoint_files(save_dir, tag) [rank99]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank99]: dist.barrier(group=self.optimizer.dp_process_group) [rank99]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank99]: return func(*args, **kwargs) [rank99]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank99]: return cdb.barrier(group=group, async_op=async_op) [rank99]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank99]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank99]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank99]: return func(*args, **kwargs) [rank99]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank99]: work = group.barrier(opts=opts) [rank99]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 99. what(): [PG ID 1 PG GUID 1 Rank 71] Process group watchdog thread terminated with exception: [Rank 71] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600087 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72f5b236c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72f567a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72f567a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72f567a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72f5b27635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72f5b6e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72f5b6f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72f5b236c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x72f5676a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x72f5b27635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x72f5b6e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x72f5b6f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank92]: Traceback (most recent call last): [rank92]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank92]: train(attn_implementation="flash_attention_2") [rank92]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank92]: trainer.train() [rank92]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank92]: return inner_training_loop( [rank92]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank92]: self._maybe_log_save_evaluate( [rank92]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank92]: self._save_checkpoint(model, trial) [rank92]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank92]: self._save_optimizer_and_scheduler(output_dir) [rank92]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank92]: self.model_wrapped.save_checkpoint(output_dir) [rank92]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank92]: self._create_zero_checkpoint_files(save_dir, tag) [rank92]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank92]: dist.barrier(group=self.optimizer.dp_process_group) [rank92]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank92]: return func(*args, **kwargs) [rank92]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank92]: return cdb.barrier(group=group, async_op=async_op) [rank92]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank92]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank92]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank92]: return func(*args, **kwargs) [rank92]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank92]: work = group.barrier(opts=opts) [rank92]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 92. [rank103]: Traceback (most recent call last): [rank103]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank103]: train(attn_implementation="flash_attention_2") [rank103]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank103]: trainer.train() [rank103]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank103]: return inner_training_loop( [rank103]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank103]: self._maybe_log_save_evaluate( [rank103]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank103]: self._save_checkpoint(model, trial) [rank103]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank103]: self._save_optimizer_and_scheduler(output_dir) [rank103]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank103]: self.model_wrapped.save_checkpoint(output_dir) [rank103]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank103]: self._create_zero_checkpoint_files(save_dir, tag) [rank103]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank103]: dist.barrier(group=self.optimizer.dp_process_group) [rank103]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank103]: return func(*args, **kwargs) [rank92]:[E217 15:05:51.355687720 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 92] Process group watchdog thread terminated with exception: [Rank 92] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74acb50e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x74ac6a42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x74ac6a431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank103]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank103]: return cdb.barrier(group=group, async_op=async_op) [rank103]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank103]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank103]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank103]: return func(*args, **kwargs) [rank103]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank103]: work = group.barrier(opts=opts) [rank103]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 103. frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x74ac6a43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x74acb58735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x74acb9a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x74acb9b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 92] Process group watchdog thread terminated with exception: [Rank 92] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74acb50e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x74ac6a42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x74ac6a431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x74ac6a43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x74acb58735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x74acb9a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x74acb9b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74acb50e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x74ac6a0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x74acb58735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x74acb9a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x74acb9b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank65]:[E217 15:05:51.931992981 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 65] Process group watchdog thread terminated with exception: [Rank 65] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72c57176c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72c526e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72c526e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72c526e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72c57245c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72c576294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72c576326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 65] Process group watchdog thread terminated with exception: [Rank 65] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72c57176c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72c526e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72c526e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72c526e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72c57245c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72c576294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72c576326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72c57176c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x72c526aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x72c57245c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x72c576294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x72c576326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank94]: Traceback (most recent call last): [rank94]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank94]: train(attn_implementation="flash_attention_2") [rank94]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank94]: trainer.train() [rank94]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank94]: return inner_training_loop( [rank94]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank94]: self._maybe_log_save_evaluate( [rank94]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank94]: self._save_checkpoint(model, trial) [rank94]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank94]: self._save_optimizer_and_scheduler(output_dir) [rank94]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank94]: self.model_wrapped.save_checkpoint(output_dir) [rank94]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank94]: self._create_zero_checkpoint_files(save_dir, tag) [rank94]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank94]: dist.barrier(group=self.optimizer.dp_process_group) [rank94]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank94]: return func(*args, **kwargs) [rank94]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank94]: return cdb.barrier(group=group, async_op=async_op) [rank94]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank94]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank94]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank94]: return func(*args, **kwargs) [rank94]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank94]: work = group.barrier(opts=opts) [rank94]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 94. [rank99]:[E217 15:05:51.791683256 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 99] Process group watchdog thread terminated with exception: [Rank 99] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76f6b6376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x76f66b62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x76f66b631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x76f66b63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x76f6b6e555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x76f6bac94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x76f6bad26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 99] Process group watchdog thread terminated with exception: [Rank 99] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76f6b6376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x76f66b62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x76f66b631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x76f66b63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x76f6b6e555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x76f6bac94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x76f6bad26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76f6b6376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x76f66b2a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x76f6b6e555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x76f6bac94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x76f6bad26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank90]: Traceback (most recent call last): [rank90]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank90]: train(attn_implementation="flash_attention_2") [rank90]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1633, in train [rank90]: trainer.train() [rank90]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank90]: return inner_training_loop( [rank90]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank90]: self._maybe_log_save_evaluate( [rank90]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate [rank90]: self._save_checkpoint(model, trial) [rank90]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3194, in _save_checkpoint [rank90]: self._save_optimizer_and_scheduler(output_dir) [rank90]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3305, in _save_optimizer_and_scheduler [rank90]: self.model_wrapped.save_checkpoint(output_dir) [rank90]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3205, in save_checkpoint [rank90]: self._create_zero_checkpoint_files(save_dir, tag) [rank90]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3376, in _create_zero_checkpoint_files [rank90]: dist.barrier(group=self.optimizer.dp_process_group) [rank90]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper [rank90]: return func(*args, **kwargs) [rank90]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier [rank90]: return cdb.barrier(group=group, async_op=async_op) [rank90]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 336, in barrier [rank90]: return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids) [rank90]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper [rank90]: return func(*args, **kwargs) [rank90]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier [rank90]: work = group.barrier(opts=opts) [rank90]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 90. [rank103]:[E217 15:05:51.801994813 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 103] Process group watchdog thread terminated with exception: [Rank 103] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70205056c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x702005c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x702005c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x702005c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70205125c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x702055294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x702055326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 103] Process group watchdog thread terminated with exception: [Rank 103] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70205056c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x702005c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x702005c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x702005c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70205125c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x702055294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x702055326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70205056c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7020058a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x70205125c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x702055294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x702055326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank94]:[E217 15:05:51.373932509 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 94] Process group watchdog thread terminated with exception: [Rank 94] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a3b93d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a3b4942a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a3b49431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a3b4943361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a3b94e5d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a3b98a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a3b98b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 94] Process group watchdog thread terminated with exception: [Rank 94] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600018 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a3b93d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a3b4942a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a3b49431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a3b4943361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a3b94e5d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a3b98a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a3b98b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a3b93d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7a3b490a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7a3b94e5d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7a3b98a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7a3b98b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank90]:[E217 15:05:51.419199340 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 90] Process group watchdog thread terminated with exception: [Rank 90] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7997f996c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7997af02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7997af031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7997af03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7997fa65c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7997fe694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7997fe726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 90] Process group watchdog thread terminated with exception: [Rank 90] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7997f996c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7997af02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7997af031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7997af03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7997fa65c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7997fe694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7997fe726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7997f996c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7997aeca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7997fa65c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7997fe694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7997fe726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank60]:[E217 15:05:51.883075470 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 60] Process group watchdog thread terminated with exception: [Rank 60] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x751c4256c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x751bf7c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x751bf7c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x751bf7c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x751c429e45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x751c47294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x751c47326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 60] Process group watchdog thread terminated with exception: [Rank 60] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66413, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x751c4256c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x751bf7c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x751bf7c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x751bf7c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x751c429e45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x751c47294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x751c47326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x751c4256c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x751bf78a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x751c429e45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x751c47294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x751c47326850 in /lib/x86_64-linux-gnu/libc.so.6) W0217 15:06:22.676000 237666 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 237822 closing signal SIGTERM W0217 15:06:22.686000 2598810 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2599571 closing signal SIGTERM W0217 15:06:22.689000 2578773 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2580234 closing signal SIGTERM W0217 15:06:22.694000 2629720 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2630496 closing signal SIGTERM W0217 15:06:22.699000 2598122 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2598255 closing signal SIGTERM W0217 15:06:22.705000 1019786 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1020157 closing signal SIGTERM W0217 15:06:22.712000 2614682 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2614815 closing signal SIGTERM W0217 15:06:22.713000 2608625 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2610225 closing signal SIGTERM W0217 15:06:22.726000 233265 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 234570 closing signal SIGTERM W0217 15:06:22.730000 2598250 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2598933 closing signal SIGTERM W0217 15:06:22.731000 2570963 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2572601 closing signal SIGTERM W0217 15:06:22.731000 262236 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 262370 closing signal SIGTERM W0217 15:06:22.733000 262147 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 262272 closing signal SIGTERM W0217 15:06:22.734000 2621360 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2622122 closing signal SIGTERM W0217 15:06:22.768000 4078901 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 4079622 closing signal SIGTERM [rank0]:[E217 15:06:27.739914174 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. W0217 15:06:52.682000 237666 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 237822 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.690000 2598810 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2599571 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.693000 2578773 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2580234 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.699000 2629720 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2630496 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.703000 2598122 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2598255 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.709000 1019786 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 1020157 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.717000 2614682 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2614815 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.717000 2608625 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2610225 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.730000 233265 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 234570 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.734000 2598250 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2598933 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.735000 2570963 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2572601 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.736000 262236 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 262370 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.737000 262147 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 262272 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.738000 2621360 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2622122 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0217 15:06:52.772000 4078901 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 4079622 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL E0217 15:07:18.101000 2608625 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 1 (pid: 2610226) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ======================================================== llava/train/train_mem.py FAILED -------------------------------------------------------- Failures: [1]: time : 2025-02-17_15:06:22 host : h100-st-p548xlarge-421.ar-ai-use2.hpcaas rank : 122 (local_rank: 2) exitcode : -6 (pid: 2610227) error_file: traceback : Signal 6 (SIGABRT) received by PID 2610227 [2]: time : 2025-02-17_15:06:22 host : h100-st-p548xlarge-421.ar-ai-use2.hpcaas rank : 123 (local_rank: 3) exitcode : -6 (pid: 2610228) error_file: traceback : Signal 6 (SIGABRT) received by PID 2610228 [3]: time : 2025-02-17_15:06:22 host : h100-st-p548xlarge-421.ar-ai-use2.hpcaas rank : 124 (local_rank: 4) exitcode : -6 (pid: 2610229) error_file: traceback : Signal 6 (SIGABRT) received by PID 2610229 [4]: time : 2025-02-17_15:06:22 host : h100-st-p548xlarge-421.ar-ai-use2.hpcaas rank : 125 (local_rank: 5) exitcode : -6 (pid: 2610230) error_file: traceback : Signal 6 (SIGABRT) received by PID 2610230 [5]: time : 2025-02-17_15:06:22 host : h100-st-p548xlarge-421.ar-ai-use2.hpcaas rank : 126 (local_rank: 6) exitcode : -6 (pid: 2610231) error_file: traceback : Signal 6 (SIGABRT) received by PID 2610231 [6]: time : 2025-02-17_15:06:22 host : h100-st-p548xlarge-421.ar-ai-use2.hpcaas rank : 127 (local_rank: 7) exitcode : -6 (pid: 2610232) error_file: traceback : Signal 6 (SIGABRT) received by PID 2610232 -------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2025-02-17_15:06:22 host : h100-st-p548xlarge-421.ar-ai-use2.hpcaas rank : 121 (local_rank: 1) exitcode : -6 (pid: 2610226) error_file: traceback : Signal 6 (SIGABRT) received by PID 2610226 ======================================================== srun: error: h100-st-p548xlarge-421: task 15: Exited with exit code 1 srun: Terminating StepId=335933.0 slurmstepd: error: *** STEP 335933.0 ON h100-st-p548xlarge-10 CANCELLED AT 2025-02-17T15:07:19 *** W0217 15:07:19.214000 262147 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.214000 233265 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.214000 1019786 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.214000 2598122 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.214000 2621360 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.214000 2598250 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.214000 262236 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.214000 2614682 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.214000 2570963 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.214000 4078901 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.214000 237666 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.214000 2578773 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.214000 2598810 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.214000 2629720 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.220000 262147 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 262272 closing signal SIGTERM W0217 15:07:19.220000 233265 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 234570 closing signal SIGTERM W0217 15:07:19.221000 2598122 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2598255 closing signal SIGTERM W0217 15:07:19.220000 2598250 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2598933 closing signal SIGTERM W0217 15:07:19.221000 2621360 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2622122 closing signal SIGTERM W0217 15:07:19.220000 2614682 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2614815 closing signal SIGTERM W0217 15:07:19.221000 2570963 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2572601 closing signal SIGTERM W0217 15:07:19.214000 1533179 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0217 15:07:19.220000 1019786 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1020157 closing signal SIGTERM W0217 15:07:19.220000 237666 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 237822 closing signal SIGTERM W0217 15:07:19.220000 4078901 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 4079622 closing signal SIGTERM W0217 15:07:19.221000 2578773 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2580234 closing signal SIGTERM W0217 15:07:19.221000 262236 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 262370 closing signal SIGTERM W0217 15:07:19.221000 2598810 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2599571 closing signal SIGTERM W0217 15:07:19.221000 2629720 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2630496 closing signal SIGTERM W0217 15:07:19.221000 1533179 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1533945 closing signal SIGTERM W0217 15:07:19.222000 1533179 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1533946 closing signal SIGTERM W0217 15:07:19.223000 1533179 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1533947 closing signal SIGTERM W0217 15:07:19.223000 1533179 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1533948 closing signal SIGTERM W0217 15:07:19.225000 1533179 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1533949 closing signal SIGTERM W0217 15:07:19.226000 1533179 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1533950 closing signal SIGTERM W0217 15:07:19.227000 1533179 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1533951 closing signal SIGTERM W0217 15:07:19.228000 1533179 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1533952 closing signal SIGTERM Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 2578773 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 2598810 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 4078901 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 2598122 got signal: 15 (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 237666 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 2614682 got signal: 15 handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 2570963 got signal: 15 return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 2629720 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 1019786 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 262147 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 2598250 got signal: 15 srun: error: h100-st-p548xlarge-414: task 11: Exited with exit code 1 srun: error: h100-st-p548xlarge-420: task 14: Exited with exit code 1 srun: error: h100-st-p548xlarge-40: task 2: Exited with exit code 1 srun: error: h100-st-p548xlarge-409: task 7: Exited with exit code 1 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 262236 got signal: 15 (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 233265 got signal: 15 srun: error: h100-st-p548xlarge-43: task 3: Exited with exit code 1 srun: error: h100-st-p548xlarge-412: task 10: Exited with exit code 1 srun: error: h100-st-p548xlarge-411: task 9: Exited with exit code 1 srun: error: h100-st-p548xlarge-419: task 13: Exited with exit code 1 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 2621360 got signal: 15 srun: error: h100-st-p548xlarge-185: task 6: Exited with exit code 1 srun: error: h100-st-p548xlarge-46: task 4: Exited with exit code 1 srun: error: h100-st-p548xlarge-415: task 12: Exited with exit code 1 srun: error: h100-st-p548xlarge-39: task 1: Exited with exit code 1 srun: error: h100-st-p548xlarge-184: task 5: Exited with exit code 1 srun: error: h100-st-p548xlarge-410: task 8: Exited with exit code 1 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 855, in _invoke_run time.sleep(monitor_interval) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 1533179 got signal: 15 srun: error: h100-st-p548xlarge-10: task 0: Exited with exit code 1