W0519 02:12:52.263000 743 torch/distributed/run.py:793] W0519 02:12:52.263000 743 torch/distributed/run.py:793] ***************************************** W0519 02:12:52.263000 743 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0519 02:12:52.263000 743 torch/distributed/run.py:793] ***************************************** [2025-05-19 02:12:54,462] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-19 02:12:54,464] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-19 02:12:54,464] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-19 02:12:54,465] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) WARNING: All log messages before absl::InitializeLog() is called are written to STDERR WARNING: All log messages before absl::InitializeLog() is called are written to STDERR WARNING: All log messages before absl::InitializeLog() is called are written to STDERR WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1747620778.093178 748 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1747620778.093175 747 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1747620778.093179 749 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1747620778.093178 746 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1747620778.099817 747 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered E0000 00:00:1747620778.099827 746 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered E0000 00:00:1747620778.099831 749 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered E0000 00:00:1747620778.099834 748 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered /usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) /usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) /usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) /usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! Replace train sampler!! petrel_client is not installed. Using PIL to load images.Replace train sampler!! Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. Using PIL to load images. petrel_client is not installed. Using PIL to load images. [2025-05-19 02:13:05,693] [INFO] [comm.py:652:init_distributed] cdb=None [2025-05-19 02:13:05,693] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file vocab.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file merges.txt [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file chat_template.jinja [2025-05-19 02:13:05,903] [INFO] [comm.py:652:init_distributed] cdb=None [2025-05-19 02:13:05,906] [INFO] [comm.py:652:init_distributed] cdb=None [2025-05-19 02:13:05,907] [INFO] [comm.py:652:init_distributed] cdb=None [INFO|tokenization_utils_base.py:2300] 2025-05-19 02:13:06,105 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:693] 2025-05-19 02:13:06,113 >> loading configuration file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/config.json [INFO|configuration_utils.py:762] 2025-05-19 02:13:06,114 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "5CD-AI/Vintern-1B-v3_5--configuration_internvl_chat.InternVLChatConfig", "AutoModel": "5CD-AI/Vintern-1B-v3_5--modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "5CD-AI/Vintern-1B-v3_5--modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_attn_implementation_autoset": false, "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct", "add_cross_attention": false, "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": 151643, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 151645, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 896, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 4864, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "max_window_layers": 21, "min_length": 0, "model_type": "qwen2", "no_repeat_ngram_size": 0, "num_attention_heads": 14, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 24, "num_key_value_heads": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sep_token_id": null, "sliding_window": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.47.0", "typical_p": 1.0, "use_bfloat16": true, "use_cache": false, "use_sliding_window": false, "vocab_size": 151674 }, "max_dynamic_patch": 4, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "Hermes-2", "torch_dtype": "float32", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_attn_implementation_autoset": false, "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.47.0", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } [INFO|modeling_utils.py:3950] 2025-05-19 02:13:06,115 >> loading weights file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/model.safetensors [INFO|modeling_utils.py:1641] 2025-05-19 02:13:06,165 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1140] 2025-05-19 02:13:06,167 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:1140] 2025-05-19 02:13:06,264 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645, "use_cache": false } [INFO|modeling_utils.py:4849] 2025-05-19 02:13:08,111 >> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4857] 2025-05-19 02:13:08,111 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:1093] 2025-05-19 02:13:08,116 >> loading configuration file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/generation_config.json [INFO|configuration_utils.py:1140] 2025-05-19 02:13:08,116 >> Generate config GenerationConfig { "eos_token_id": [ 151644, 151645, 151643 ] } [WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:13:14,247 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors [WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:13:14,605 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors [WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:13:14,662 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors [WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:13:14,695 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780 trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780 trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780 trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780 [INFO|trainer.py:734] 2025-05-19 02:13:24,433 >> Using auto half precision backend [WARNING|trainer.py:796] 2025-05-19 02:13:24,656 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:796] 2025-05-19 02:13:24,656 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:796] 2025-05-19 02:13:24,660 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:796] 2025-05-19 02:13:24,660 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:796] 2025-05-19 02:13:24,680 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:796] 2025-05-19 02:13:24,680 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:796] 2025-05-19 02:13:24,821 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:796] 2025-05-19 02:13:24,821 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [2025-05-19 02:13:24,847] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.15.4, git-hash=unknown, git-branch=unknown [2025-05-19 02:13:24,847] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4 [2025-05-19 02:13:26,021] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... Creating extension directory /root/.cache/torch_extensions/py311_cu124/fused_adam... Creating extension directory /root/.cache/torch_extensions/py311_cu124/fused_adam... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.11/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.11/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.11/dist-packages/torch/include -isystem /usr/local/lib/python3.11/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.11/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.11/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /usr/local/lib/python3.11/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o [2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.11/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.11/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.11/dist-packages/torch/include -isystem /usr/local/lib/python3.11/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.11/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.11/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /usr/local/lib/python3.11/dist-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o [3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/usr/local/lib/python3.11/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so Loading extension module fused_adam... Time to load fused_adam op: 34.320404291152954 seconds [2025-05-19 02:14:00,350] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2025-05-19 02:14:00,350] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer Loading extension module fused_adam... Time to load fused_adam op: 34.33998680114746 seconds [2025-05-19 02:14:00,382] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2025-05-19 02:14:00,382] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2025-05-19 02:14:00,382] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2025-05-19 02:14:00,382] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000 [2025-05-19 02:14:00,382] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000 [2025-05-19 02:14:00,383] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False [2025-05-19 02:14:00,383] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 34.33716344833374 seconds Time to load fused_adam op: 34.33844709396362 seconds [2025-05-19 02:14:00,844] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2025-05-19 02:14:00,845] [INFO] [utils.py:782:see_memory_usage] MA 1.78 GB Max_MA 1.79 GB CA 1.88 GB Max_CA 2 GB [2025-05-19 02:14:00,845] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 14.32 GB, percent = 7.6% [2025-05-19 02:14:01,226] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2025-05-19 02:14:01,226] [INFO] [utils.py:782:see_memory_usage] MA 1.78 GB Max_MA 1.79 GB CA 1.88 GB Max_CA 2 GB [2025-05-19 02:14:01,227] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 14.37 GB, percent = 7.6% [2025-05-19 02:14:01,227] [INFO] [stage_1_and_2.py:544:__init__] optimizer state initialized [2025-05-19 02:14:01,600] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2025-05-19 02:14:01,601] [INFO] [utils.py:782:see_memory_usage] MA 1.78 GB Max_MA 1.78 GB CA 1.88 GB Max_CA 2 GB [2025-05-19 02:14:01,601] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 14.67 GB, percent = 7.8% [2025-05-19 02:14:01,603] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2025-05-19 02:14:01,604] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2025-05-19 02:14:01,604] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2025-05-19 02:14:01,604] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2025-05-19 02:14:01,610] [INFO] [config.py:999:print] DeepSpeedEngine configuration: [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] amp_enabled .................. False [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] amp_params ................... False [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] bfloat16_enabled ............. True [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] bfloat16_immediate_grad_update False [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] checkpoint_parallel_write_pipeline False [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] checkpoint_tag_validation_enabled True [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] checkpoint_tag_validation_fail False [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] comms_config ................. [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] communication_data_type ...... None [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] dataloader_drop_last ......... False [2025-05-19 02:14:01,611] [INFO] [config.py:1003:print] disable_allgather ............ False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] dump_state ................... False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] elasticity_enabled ........... False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] fp16_auto_cast ............... None [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] fp16_enabled ................. False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] global_rank .................. 0 [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] grad_accum_dtype ............. None [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 2 [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] graph_harvesting ............. False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] load_universal_checkpoint .... False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] loss_scale ................... 1.0 [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] memory_breakdown ............. False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] mics_shard_size .............. -1 [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] optimizer_name ............... adamw [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] pld_enabled .................. False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] pld_params ................... False [2025-05-19 02:14:01,612] [INFO] [config.py:1003:print] prescale_gradients ........... False [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] scheduler_name ............... None [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] scheduler_params ............. None [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] sparse_attention ............. None [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] steps_per_print .............. inf [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] train_batch_size ............. 128 [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 16 [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] use_node_local_storage ....... False [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] wall_clock_breakdown ......... True [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] weight_quantization_config ... None [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] world_size ................... 4 [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] zero_allow_untested_optimizer False [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] zero_enabled ................. True [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True [2025-05-19 02:14:01,613] [INFO] [config.py:1003:print] zero_optimization_stage ...... 1 [2025-05-19 02:14:01,613] [INFO] [config.py:989:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 2, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 128, "train_micro_batch_size_per_gpu": 16, "wall_clock_breakdown": true } [INFO|trainer.py:2361] 2025-05-19 02:14:01,614 >> ***** Running training ***** [INFO|trainer.py:2362] 2025-05-19 02:14:01,615 >> Num examples = 28,826 [INFO|trainer.py:2363] 2025-05-19 02:14:01,615 >> Num Epochs = 1 [INFO|trainer.py:2364] 2025-05-19 02:14:01,615 >> Instantaneous batch size per device = 16 [INFO|trainer.py:2367] 2025-05-19 02:14:01,615 >> Total train batch size (w. parallel, distributed & accumulation) = 128 [INFO|trainer.py:2368] 2025-05-19 02:14:01,615 >> Gradient Accumulation steps = 2 [INFO|trainer.py:2369] 2025-05-19 02:14:01,615 >> Total optimization steps = 225 [INFO|trainer.py:2370] 2025-05-19 02:14:01,620 >> Number of trainable parameters = 8,798,208 [INFO|integration_utils.py:811] 2025-05-19 02:14:01,625 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: tienanh2003 (tienanh) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Tracking run with wandb version 0.19.6 wandb: Run data is saved locally in /kaggle/working/Vintern/internvl_chat/wandb/run-20250519_021401-stvnubd3 wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run Finetune_OCR wandb: ⭐️ View project at https://wandb.ai/tienanh/fine-tuned-vintern1B_v3_5_OCR wandb: 🚀 View run at https://wandb.ai/tienanh/fine-tuned-vintern1B_v3_5_OCR/runs/stvnubd3 0%| | 0/225 [00:00 [rank2]: main() [rank2]: File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main [rank2]: train_result = trainer.train(resume_from_checkpoint=checkpoint) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2155, in train [rank2]: return inner_training_loop( [rank2]: ^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2522, in _inner_training_loop [rank2]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3688, in training_step [rank2]: self.accelerator.backward(loss, **kwargs) [rank2]: File "/usr/local/lib/python3.11/dist-packages/accelerate/accelerator.py", line 2238, in backward [rank2]: self.deepspeed_engine_wrapped.backward(loss, **kwargs) [rank2]: File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/deepspeed.py", line 261, in backward [rank2]: self.engine.backward(loss, **kwargs) [rank2]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn [rank2]: ret_val = func(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 2020, in backward [rank2]: self.optimizer.backward(loss, retain_graph=retain_graph) [rank2]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2058, in backward [rank2]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) [rank2]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward [rank2]: scaled_loss.backward(retain_graph=retain_graph) [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/_tensor.py", line 581, in backward [rank2]: torch.autograd.backward( [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/autograd/__init__.py", line 347, in backward [rank2]: _engine_run_backward( [rank2]: File "/usr/local/lib/python3.11/dist-packages/torch/autograd/graph.py", line 825, in _engine_run_backward [rank2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.00 GiB. GPU 2 has a total capacity of 22.28 GiB of which 709.38 MiB is free. Process 16075 has 21.58 GiB memory in use. Of the allocated memory 18.60 GiB is allocated by PyTorch, and 2.65 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank3]: Traceback (most recent call last): [rank3]: File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in [rank3]: main() [rank3]: File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main [rank3]: train_result = trainer.train(resume_from_checkpoint=checkpoint) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2155, in train [rank3]: return inner_training_loop( [rank3]: ^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2522, in _inner_training_loop [rank3]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3655, in training_step [rank3]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3709, in compute_loss [rank3]: outputs = model(**inputs) [rank3]: ^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank3]: return self._call_impl(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank3]: return forward_call(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn [rank3]: ret_val = func(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1899, in forward [rank3]: loss = self.module(*inputs, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank3]: return self._call_impl(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank3]: return forward_call(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/kaggle/working/Vintern/internvl_chat/internvl/model/internvl_chat/modeling_internvl_chat.py", line 202, in forward [rank3]: loss = loss_fct(shift_logits, shift_labels) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank3]: return self._call_impl(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank3]: return forward_call(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/loss.py", line 1293, in forward [rank3]: return F.cross_entropy( [rank3]: ^^^^^^^^^^^^^^^^ [rank3]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py", line 3479, in cross_entropy [rank3]: return torch._C._nn.cross_entropy_loss( [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.43 GiB. GPU 3 has a total capacity of 22.28 GiB of which 5.72 GiB is free. Process 16076 has 16.55 GiB memory in use. Of the allocated memory 15.83 GiB is allocated by PyTorch, and 387.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank1]: Traceback (most recent call last): [rank1]: File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in [rank1]: main() [rank1]: File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main [rank1]: train_result = trainer.train(resume_from_checkpoint=checkpoint) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2155, in train [rank1]: return inner_training_loop( [rank1]: ^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2522, in _inner_training_loop [rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3655, in training_step [rank1]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3709, in compute_loss [rank1]: outputs = model(**inputs) [rank1]: ^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn [rank1]: ret_val = func(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1899, in forward [rank1]: loss = self.module(*inputs, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/kaggle/working/Vintern/internvl_chat/internvl/model/internvl_chat/modeling_internvl_chat.py", line 202, in forward [rank1]: loss = loss_fct(shift_logits, shift_labels) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/loss.py", line 1293, in forward [rank1]: return F.cross_entropy( [rank1]: ^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py", line 3479, in cross_entropy [rank1]: return torch._C._nn.cross_entropy_loss( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.42 GiB. GPU 1 has a total capacity of 22.28 GiB of which 5.73 GiB is free. Process 16074 has 16.54 GiB memory in use. Of the allocated memory 15.83 GiB is allocated by PyTorch, and 386.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) dynamic ViT batch size: 16, images per sample: 1.0, dynamic token length: 1340 Traceback (most recent call last): File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in main() File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2155, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2522, in _inner_training_loop tr_loss_step = self.training_step(model, inputs, num_items_in_batch) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3688, in training_step self.accelerator.backward(loss, **kwargs) File "/usr/local/lib/python3.11/dist-packages/accelerate/accelerator.py", line 2238, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/deepspeed.py", line 261, in backward self.engine.backward(loss, **kwargs) File "/usr/local/lib/python3.11/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn ret_val = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 2020, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2058, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/usr/local/lib/python3.11/dist-packages/torch/_tensor.py", line 581, in backward torch.autograd.backward( File "/usr/local/lib/python3.11/dist-packages/torch/autograd/__init__.py", line 347, in backward _engine_run_backward( File "/usr/local/lib/python3.11/dist-packages/torch/autograd/graph.py", line 825, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.05 GiB. GPU 0 has a total capacity of 22.28 GiB of which 537.38 MiB is free. Process 16073 has 21.75 GiB memory in use. Of the allocated memory 18.71 GiB is allocated by PyTorch, and 2.70 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank0]: Traceback (most recent call last): [rank0]: File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in [rank0]: main() [rank0]: File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main [rank0]: train_result = trainer.train(resume_from_checkpoint=checkpoint) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2155, in train [rank0]: return inner_training_loop( [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2522, in _inner_training_loop [rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3688, in training_step [rank0]: self.accelerator.backward(loss, **kwargs) [rank0]: File "/usr/local/lib/python3.11/dist-packages/accelerate/accelerator.py", line 2238, in backward [rank0]: self.deepspeed_engine_wrapped.backward(loss, **kwargs) [rank0]: File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/deepspeed.py", line 261, in backward [rank0]: self.engine.backward(loss, **kwargs) [rank0]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn [rank0]: ret_val = func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 2020, in backward [rank0]: self.optimizer.backward(loss, retain_graph=retain_graph) [rank0]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2058, in backward [rank0]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) [rank0]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward [rank0]: scaled_loss.backward(retain_graph=retain_graph) [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/_tensor.py", line 581, in backward [rank0]: torch.autograd.backward( [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/autograd/__init__.py", line 347, in backward [rank0]: _engine_run_backward( [rank0]: File "/usr/local/lib/python3.11/dist-packages/torch/autograd/graph.py", line 825, in _engine_run_backward [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.05 GiB. GPU 0 has a total capacity of 22.28 GiB of which 537.38 MiB is free. Process 16073 has 21.75 GiB memory in use. Of the allocated memory 18.71 GiB is allocated by PyTorch, and 2.70 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) wandb: wandb: 🚀 View run Finetune_OCR at: https://wandb.ai/tienanh/fine-tuned-vintern1B_v3_5_OCR/runs/stvnubd3 wandb: Find logs at: wandb/run-20250519_021401-stvnubd3/logs W0519 02:14:49.552000 743 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 746 closing signal SIGTERM E0519 02:14:49.967000 743 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 747) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 10, in sys.exit(main()) ^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 919, in main run(args) File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ internvl/train/internvl_chat_finetune.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2025-05-19_02:14:49 host : 27c18ac09229 rank : 2 (local_rank: 2) exitcode : 1 (pid: 748) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2025-05-19_02:14:49 host : 27c18ac09229 rank : 3 (local_rank: 3) exitcode : 1 (pid: 749) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-05-19_02:14:49 host : 27c18ac09229 rank : 1 (local_rank: 1) exitcode : 1 (pid: 747) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ W0519 02:16:50.115000 2646 torch/distributed/run.py:793] W0519 02:16:50.115000 2646 torch/distributed/run.py:793] ***************************************** W0519 02:16:50.115000 2646 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0519 02:16:50.115000 2646 torch/distributed/run.py:793] ***************************************** [2025-05-19 02:16:52,314] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-19 02:16:52,314] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-19 02:16:52,345] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-19 02:16:52,366] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) WARNING: All log messages before absl::InitializeLog() is called are written to STDERR WARNING: All log messages before absl::InitializeLog() is called are written to STDERR WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1747621015.959415 2650 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1747621015.959416 2651 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1747621015.959417 2652 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1747621015.959605 2649 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1747621015.966111 2651 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered E0000 00:00:1747621015.966116 2652 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered E0000 00:00:1747621015.966126 2650 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered E0000 00:00:1747621015.966315 2649 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered W0519 02:17:10.094000 2788 torch/distributed/run.py:793] W0519 02:17:10.094000 2788 torch/distributed/run.py:793] ***************************************** W0519 02:17:10.094000 2788 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0519 02:17:10.094000 2788 torch/distributed/run.py:793] ***************************************** [2025-05-19 02:17:12,294] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-19 02:17:12,306] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-19 02:17:12,343] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-19 02:17:12,344] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1747621035.902815 2793 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1747621035.909401 2793 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1747621035.957164 2792 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1747621035.957450 2791 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1747621035.957448 2794 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1747621035.963818 2792 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered E0000 00:00:1747621035.964008 2794 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered E0000 00:00:1747621035.964225 2791 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered /usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) /usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) /usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) /usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2025-05-19 02:17:20,100] [INFO] [comm.py:652:init_distributed] cdb=None [2025-05-19 02:17:20,100] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file vocab.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file merges.txt [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file chat_template.jinja [2025-05-19 02:17:20,210] [INFO] [comm.py:652:init_distributed] cdb=None [2025-05-19 02:17:20,238] [INFO] [comm.py:652:init_distributed] cdb=None [2025-05-19 02:17:20,243] [INFO] [comm.py:652:init_distributed] cdb=None [INFO|tokenization_utils_base.py:2300] 2025-05-19 02:17:20,472 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:693] 2025-05-19 02:17:20,479 >> loading configuration file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/config.json [INFO|configuration_utils.py:762] 2025-05-19 02:17:20,481 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "5CD-AI/Vintern-1B-v3_5--configuration_internvl_chat.InternVLChatConfig", "AutoModel": "5CD-AI/Vintern-1B-v3_5--modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "5CD-AI/Vintern-1B-v3_5--modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_attn_implementation_autoset": false, "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct", "add_cross_attention": false, "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": 151643, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 151645, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 896, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 4864, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "max_window_layers": 21, "min_length": 0, "model_type": "qwen2", "no_repeat_ngram_size": 0, "num_attention_heads": 14, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 24, "num_key_value_heads": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sep_token_id": null, "sliding_window": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.47.0", "typical_p": 1.0, "use_bfloat16": true, "use_cache": false, "use_sliding_window": false, "vocab_size": 151674 }, "max_dynamic_patch": 4, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "Hermes-2", "torch_dtype": "float32", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_attn_implementation_autoset": false, "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.47.0", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } [INFO|modeling_utils.py:3950] 2025-05-19 02:17:20,481 >> loading weights file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/model.safetensors [INFO|modeling_utils.py:1641] 2025-05-19 02:17:20,507 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1140] 2025-05-19 02:17:20,508 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:1140] 2025-05-19 02:17:20,570 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645, "use_cache": false } [INFO|modeling_utils.py:4849] 2025-05-19 02:17:22,417 >> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4857] 2025-05-19 02:17:22,417 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:1093] 2025-05-19 02:17:22,422 >> loading configuration file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/generation_config.json [INFO|configuration_utils.py:1140] 2025-05-19 02:17:22,422 >> Generate config GenerationConfig { "eos_token_id": [ 151644, 151645, 151643 ] } W0519 02:17:36.849000 2964 torch/distributed/run.py:793] W0519 02:17:36.849000 2964 torch/distributed/run.py:793] ***************************************** W0519 02:17:36.849000 2964 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0519 02:17:36.849000 2964 torch/distributed/run.py:793] ***************************************** [2025-05-19 02:17:39,047] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-19 02:17:39,054] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-19 02:17:39,056] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-19 02:17:39,057] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) WARNING: All log messages before absl::InitializeLog() is called are written to STDERR WARNING: All log messages before absl::InitializeLog() is called are written to STDERR WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1747621062.691754 2970 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1747621062.691754 2967 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1747621062.691747 2969 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1747621062.691760 2968 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1747621062.698420 2970 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered E0000 00:00:1747621062.698441 2969 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered E0000 00:00:1747621062.698452 2967 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered E0000 00:00:1747621062.698472 2968 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered /usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) /usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) /usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) /usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! Replace train sampler!!Replace train sampler!! Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. Using PIL to load images.petrel_client is not installed. Using PIL to load images. petrel_client is not installed. Using PIL to load images. [2025-05-19 02:17:46,807] [INFO] [comm.py:652:init_distributed] cdb=None [2025-05-19 02:17:46,807] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file vocab.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file merges.txt [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file chat_template.jinja [2025-05-19 02:17:46,970] [INFO] [comm.py:652:init_distributed] cdb=None [2025-05-19 02:17:46,970] [INFO] [comm.py:652:init_distributed] cdb=None [2025-05-19 02:17:46,971] [INFO] [comm.py:652:init_distributed] cdb=None [INFO|tokenization_utils_base.py:2300] 2025-05-19 02:17:47,185 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:693] 2025-05-19 02:17:47,193 >> loading configuration file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/config.json [INFO|configuration_utils.py:762] 2025-05-19 02:17:47,195 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "5CD-AI/Vintern-1B-v3_5--configuration_internvl_chat.InternVLChatConfig", "AutoModel": "5CD-AI/Vintern-1B-v3_5--modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "5CD-AI/Vintern-1B-v3_5--modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_attn_implementation_autoset": false, "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct", "add_cross_attention": false, "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": 151643, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 151645, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 896, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 4864, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "max_window_layers": 21, "min_length": 0, "model_type": "qwen2", "no_repeat_ngram_size": 0, "num_attention_heads": 14, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 24, "num_key_value_heads": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sep_token_id": null, "sliding_window": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.47.0", "typical_p": 1.0, "use_bfloat16": true, "use_cache": false, "use_sliding_window": false, "vocab_size": 151674 }, "max_dynamic_patch": 4, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "Hermes-2", "torch_dtype": "float32", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_attn_implementation_autoset": false, "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.47.0", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } [INFO|modeling_utils.py:3950] 2025-05-19 02:17:47,195 >> loading weights file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/model.safetensors [INFO|modeling_utils.py:1641] 2025-05-19 02:17:47,220 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1140] 2025-05-19 02:17:47,222 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:1140] 2025-05-19 02:17:47,281 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645, "use_cache": false } [INFO|modeling_utils.py:4849] 2025-05-19 02:17:49,107 >> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4857] 2025-05-19 02:17:49,107 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:1093] 2025-05-19 02:17:49,111 >> loading configuration file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/generation_config.json [INFO|configuration_utils.py:1140] 2025-05-19 02:17:49,112 >> Generate config GenerationConfig { "eos_token_id": [ 151644, 151645, 151643 ] } [WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:17:55,152 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors [WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:17:55,228 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors [WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:17:55,241 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors [WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:17:55,307 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780 trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780 trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780 trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780 [WARNING|trainer.py:796] 2025-05-19 02:18:05,034 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:796] 2025-05-19 02:18:05,034 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:796] 2025-05-19 02:18:05,043 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:796] 2025-05-19 02:18:05,043 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:796] 2025-05-19 02:18:05,204 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:796] 2025-05-19 02:18:05,204 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [INFO|trainer.py:734] 2025-05-19 02:18:13,851 >> Using auto half precision backend [WARNING|trainer.py:796] 2025-05-19 02:18:14,211 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:796] 2025-05-19 02:18:14,211 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [2025-05-19 02:18:14,235] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.15.4, git-hash=unknown, git-branch=unknown [2025-05-19 02:18:14,236] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4 [2025-05-19 02:18:15,237] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... Detected CUDA files, patching ldflags Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.0371098518371582 seconds Loading extension module fused_adam... Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.10166311264038086 secondsTime to load fused_adam op: 0.10155916213989258 seconds Time to load fused_adam op: 0.10158443450927734 seconds [2025-05-19 02:18:15,491] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2025-05-19 02:18:15,492] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2025-05-19 02:18:15,523] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2025-05-19 02:18:15,523] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2025-05-19 02:18:15,523] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2025-05-19 02:18:15,523] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000 [2025-05-19 02:18:15,523] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000 [2025-05-19 02:18:15,523] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False [2025-05-19 02:18:15,523] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False [2025-05-19 02:18:15,956] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2025-05-19 02:18:15,956] [INFO] [utils.py:782:see_memory_usage] MA 1.78 GB Max_MA 1.79 GB CA 1.88 GB Max_CA 2 GB [2025-05-19 02:18:15,957] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 14.32 GB, percent = 7.6% [2025-05-19 02:18:16,334] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2025-05-19 02:18:16,335] [INFO] [utils.py:782:see_memory_usage] MA 1.78 GB Max_MA 1.79 GB CA 1.88 GB Max_CA 2 GB [2025-05-19 02:18:16,335] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 14.47 GB, percent = 7.7% [2025-05-19 02:18:16,335] [INFO] [stage_1_and_2.py:544:__init__] optimizer state initialized [2025-05-19 02:18:16,703] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2025-05-19 02:18:16,704] [INFO] [utils.py:782:see_memory_usage] MA 1.78 GB Max_MA 1.78 GB CA 1.88 GB Max_CA 2 GB [2025-05-19 02:18:16,704] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 14.66 GB, percent = 7.8% [2025-05-19 02:18:16,706] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2025-05-19 02:18:16,706] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2025-05-19 02:18:16,707] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2025-05-19 02:18:16,707] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2025-05-19 02:18:16,713] [INFO] [config.py:999:print] DeepSpeedEngine configuration: [2025-05-19 02:18:16,713] [INFO] [config.py:1003:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] amp_enabled .................. False [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] amp_params ................... False [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] bfloat16_enabled ............. True [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] bfloat16_immediate_grad_update False [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] checkpoint_parallel_write_pipeline False [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] checkpoint_tag_validation_enabled True [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] checkpoint_tag_validation_fail False [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] comms_config ................. [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] communication_data_type ...... None [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] dataloader_drop_last ......... False [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] disable_allgather ............ False [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] dump_state ................... False [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 [2025-05-19 02:18:16,714] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] elasticity_enabled ........... False [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] fp16_auto_cast ............... None [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] fp16_enabled ................. False [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] global_rank .................. 0 [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] grad_accum_dtype ............. None [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 2 [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] graph_harvesting ............. False [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] load_universal_checkpoint .... False [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] loss_scale ................... 1.0 [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] memory_breakdown ............. False [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] mics_shard_size .............. -1 [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] optimizer_name ............... adamw [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] pld_enabled .................. False [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] pld_params ................... False [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] prescale_gradients ........... False [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] scheduler_name ............... None [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] scheduler_params ............. None [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 [2025-05-19 02:18:16,715] [INFO] [config.py:1003:print] sparse_attention ............. None [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] steps_per_print .............. inf [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] train_batch_size ............. 64 [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 8 [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] use_node_local_storage ....... False [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] wall_clock_breakdown ......... True [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] weight_quantization_config ... None [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] world_size ................... 4 [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] zero_allow_untested_optimizer False [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] zero_enabled ................. True [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True [2025-05-19 02:18:16,716] [INFO] [config.py:1003:print] zero_optimization_stage ...... 1 [2025-05-19 02:18:16,716] [INFO] [config.py:989:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 2, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 64, "train_micro_batch_size_per_gpu": 8, "wall_clock_breakdown": true } [INFO|trainer.py:2361] 2025-05-19 02:18:16,717 >> ***** Running training ***** [INFO|trainer.py:2362] 2025-05-19 02:18:16,717 >> Num examples = 28,826 [INFO|trainer.py:2363] 2025-05-19 02:18:16,718 >> Num Epochs = 1 [INFO|trainer.py:2364] 2025-05-19 02:18:16,718 >> Instantaneous batch size per device = 8 [INFO|trainer.py:2367] 2025-05-19 02:18:16,718 >> Total train batch size (w. parallel, distributed & accumulation) = 64 [INFO|trainer.py:2368] 2025-05-19 02:18:16,718 >> Gradient Accumulation steps = 2 [INFO|trainer.py:2369] 2025-05-19 02:18:16,718 >> Total optimization steps = 450 [INFO|trainer.py:2370] 2025-05-19 02:18:16,722 >> Number of trainable parameters = 8,798,208 [INFO|integration_utils.py:811] 2025-05-19 02:18:16,727 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: tienanh2003 (tienanh) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Tracking run with wandb version 0.19.6 wandb: Run data is saved locally in /kaggle/working/Vintern/internvl_chat/wandb/run-20250519_021816-5budu5nq wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run Finetune_OCR wandb: ⭐️ View project at https://wandb.ai/tienanh/fine-tuned-vintern1B_v3_5_OCR wandb: 🚀 View run at https://wandb.ai/tienanh/fine-tuned-vintern1B_v3_5_OCR/runs/5budu5nq 0%| | 0/450 [00:00> Saving model checkpoint to /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450 [INFO|configuration_utils.py:419] 2025-05-19 02:57:00,042 >> Configuration saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/config.json [INFO|configuration_utils.py:909] 2025-05-19 02:57:00,042 >> Configuration saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/generation_config.json [INFO|modeling_utils.py:3042] 2025-05-19 02:57:02,632 >> Model weights saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/model.safetensors [INFO|tokenization_utils_base.py:2485] 2025-05-19 02:57:02,667 >> tokenizer config file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/tokenizer_config.json [INFO|tokenization_utils_base.py:2494] 2025-05-19 02:57:02,667 >> Special tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/special_tokens_map.json [INFO|tokenization_utils_base.py:2545] 2025-05-19 02:57:02,668 >> added tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/added_tokens.json [2025-05-19 02:57:02,883] [INFO] [logging.py:128:log_dist] [Rank 0] [Torch] Checkpoint global_step450 is about to be saved! [2025-05-19 02:57:02,905] [INFO] [logging.py:128:log_dist] [Rank 0] Saving model checkpoint: /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/global_step450/mp_rank_00_model_states.pt [2025-05-19 02:57:02,905] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/global_step450/mp_rank_00_model_states.pt... [2025-05-19 02:57:04,984] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/global_step450/mp_rank_00_model_states.pt. [2025-05-19 02:57:04,987] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/global_step450/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2025-05-19 02:57:05,018] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/global_step450/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2025-05-19 02:57:05,019] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/global_step450/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-05-19 02:57:05,020] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step450 is ready now! [INFO|tokenization_utils_base.py:2485] 2025-05-19 02:57:06,106 >> tokenizer config file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/tokenizer_config.json [INFO|tokenization_utils_base.py:2494] 2025-05-19 02:57:06,107 >> Special tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/special_tokens_map.json [INFO|tokenization_utils_base.py:2545] 2025-05-19 02:57:06,107 >> added tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/added_tokens.json [INFO|trainer.py:2634] 2025-05-19 02:57:06,319 >> Training completed. Do not forget to share your model on huggingface.co/models =) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. {'train_runtime': 2329.597, 'train_samples_per_second': 12.374, 'train_steps_per_second': 0.193, 'train_loss': 1.1245052083333333, 'epoch': 1.0} 100%|██████████| 450/450 [38:48<00:00, 5.04s/it] 100%|██████████| 450/450 [38:48<00:00, 5.17s/it] [INFO|trainer.py:4669] 2025-05-19 02:57:06,326 >> Waiting for the current checkpoint push to be finished, this might take a couple of minutes. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [INFO|trainer.py:3888] 2025-05-19 02:57:51,976 >> Saving model checkpoint to /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR [INFO|configuration_utils.py:419] 2025-05-19 02:57:51,980 >> Configuration saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/config.json [INFO|configuration_utils.py:909] 2025-05-19 02:57:51,980 >> Configuration saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/generation_config.json [INFO|modeling_utils.py:3042] 2025-05-19 02:57:55,707 >> Model weights saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/model.safetensors [INFO|tokenization_utils_base.py:2485] 2025-05-19 02:57:55,710 >> tokenizer config file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/tokenizer_config.json [INFO|tokenization_utils_base.py:2494] 2025-05-19 02:57:55,711 >> Special tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/special_tokens_map.json [INFO|tokenization_utils_base.py:2545] 2025-05-19 02:57:55,711 >> added tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/added_tokens.json [INFO|trainer.py:3888] 2025-05-19 02:57:56,380 >> Saving model checkpoint to /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR [INFO|configuration_utils.py:419] 2025-05-19 02:57:56,384 >> Configuration saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/config.json [INFO|configuration_utils.py:909] 2025-05-19 02:57:56,384 >> Configuration saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/generation_config.json [INFO|modeling_utils.py:3042] 2025-05-19 02:58:00,185 >> Model weights saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/model.safetensors [INFO|tokenization_utils_base.py:2485] 2025-05-19 02:58:00,188 >> tokenizer config file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/tokenizer_config.json [INFO|tokenization_utils_base.py:2494] 2025-05-19 02:58:00,189 >> Special tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/special_tokens_map.json [INFO|tokenization_utils_base.py:2545] 2025-05-19 02:58:00,189 >> added tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/added_tokens.json