W0519 02:12:52.263000 743 torch/distributed/run.py:793] 
W0519 02:12:52.263000 743 torch/distributed/run.py:793] *****************************************
W0519 02:12:52.263000 743 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0519 02:12:52.263000 743 torch/distributed/run.py:793] *****************************************
[2025-05-19 02:12:54,462] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:12:54,464] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:12:54,464] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:12:54,465] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620778.093178     748 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620778.093175     747 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620778.093179     749 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620778.093178     746 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620778.099817     747 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747620778.099827     746 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747620778.099831     749 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747620778.099834     748 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.Replace train sampler!!

Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. Using PIL to load images.
[2025-05-19 02:13:05,693] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-19 02:13:05,693] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:13:05,832 >> loading file chat_template.jinja
[2025-05-19 02:13:05,903] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-19 02:13:05,906] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-19 02:13:05,907] [INFO] [comm.py:652:init_distributed] cdb=None
[INFO|tokenization_utils_base.py:2300] 2025-05-19 02:13:06,105 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:693] 2025-05-19 02:13:06,113 >> loading configuration file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/config.json
[INFO|configuration_utils.py:762] 2025-05-19 02:13:06,114 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "5CD-AI/Vintern-1B-v3_5--configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "5CD-AI/Vintern-1B-v3_5--modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "5CD-AI/Vintern-1B-v3_5--modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "llm_config": {
    "_attn_implementation_autoset": false,
    "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
    "add_cross_attention": false,
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 151643,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 151645,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 896,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 4864,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "max_window_layers": 21,
    "min_length": 0,
    "model_type": "qwen2",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 14,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 24,
    "num_key_value_heads": 2,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-06,
    "rope_scaling": null,
    "rope_theta": 1000000.0,
    "sep_token_id": null,
    "sliding_window": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.47.0",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": false,
    "use_sliding_window": false,
    "vocab_size": 151674
  },
  "max_dynamic_patch": 4,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "template": "Hermes-2",
  "torch_dtype": "float32",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_attn_implementation_autoset": false,
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "no_repeat_ngram_size": 0,
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.47.0",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true
  }
}

[INFO|modeling_utils.py:3950] 2025-05-19 02:13:06,115 >> loading weights file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/model.safetensors
[INFO|modeling_utils.py:1641] 2025-05-19 02:13:06,165 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1140] 2025-05-19 02:13:06,167 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:1140] 2025-05-19 02:13:06,264 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "use_cache": false
}

[INFO|modeling_utils.py:4849] 2025-05-19 02:13:08,111 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4857] 2025-05-19 02:13:08,111 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:1093] 2025-05-19 02:13:08,116 >> loading configuration file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/generation_config.json
[INFO|configuration_utils.py:1140] 2025-05-19 02:13:08,116 >> Generate config GenerationConfig {
  "eos_token_id": [
    151644,
    151645,
    151643
  ]
}

[WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:13:14,247 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors
[WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:13:14,605 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors
[WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:13:14,662 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors
[WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:13:14,695 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors
trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780
trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780
trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780
trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780
[INFO|trainer.py:734] 2025-05-19 02:13:24,433 >> Using auto half precision backend
[WARNING|trainer.py:796] 2025-05-19 02:13:24,656 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:796] 2025-05-19 02:13:24,656 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:796] 2025-05-19 02:13:24,660 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:796] 2025-05-19 02:13:24,660 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:796] 2025-05-19 02:13:24,680 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:796] 2025-05-19 02:13:24,680 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:796] 2025-05-19 02:13:24,821 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:796] 2025-05-19 02:13:24,821 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[2025-05-19 02:13:24,847] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.15.4, git-hash=unknown, git-branch=unknown
[2025-05-19 02:13:24,847] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
[2025-05-19 02:13:26,021] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...

Creating extension directory /root/.cache/torch_extensions/py311_cu124/fused_adam...
Creating extension directory /root/.cache/torch_extensions/py311_cu124/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
[1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.11/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.11/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.11/dist-packages/torch/include -isystem /usr/local/lib/python3.11/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.11/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.11/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /usr/local/lib/python3.11/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.11/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.11/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.11/dist-packages/torch/include -isystem /usr/local/lib/python3.11/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.11/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.11/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /usr/local/lib/python3.11/dist-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
[3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/usr/local/lib/python3.11/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so
Loading extension module fused_adam...
Time to load fused_adam op: 34.320404291152954 seconds
[2025-05-19 02:14:00,350] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2025-05-19 02:14:00,350] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
Loading extension module fused_adam...
Time to load fused_adam op: 34.33998680114746 seconds
[2025-05-19 02:14:00,382] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2025-05-19 02:14:00,382] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2025-05-19 02:14:00,382] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer
[2025-05-19 02:14:00,382] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000
[2025-05-19 02:14:00,382] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000
[2025-05-19 02:14:00,383] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2025-05-19 02:14:00,383] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 34.33716344833374 seconds
Time to load fused_adam op: 34.33844709396362 seconds
[2025-05-19 02:14:00,844] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-05-19 02:14:00,845] [INFO] [utils.py:782:see_memory_usage] MA 1.78 GB         Max_MA 1.79 GB         CA 1.88 GB         Max_CA 2 GB 
[2025-05-19 02:14:00,845] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 14.32 GB, percent = 7.6%
[2025-05-19 02:14:01,226] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-05-19 02:14:01,226] [INFO] [utils.py:782:see_memory_usage] MA 1.78 GB         Max_MA 1.79 GB         CA 1.88 GB         Max_CA 2 GB 
[2025-05-19 02:14:01,227] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 14.37 GB, percent = 7.6%
[2025-05-19 02:14:01,227] [INFO] [stage_1_and_2.py:544:__init__] optimizer state initialized
[2025-05-19 02:14:01,600] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-05-19 02:14:01,601] [INFO] [utils.py:782:see_memory_usage] MA 1.78 GB         Max_MA 1.78 GB         CA 1.88 GB         Max_CA 2 GB 
[2025-05-19 02:14:01,601] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 14.67 GB, percent = 7.8%
[2025-05-19 02:14:01,603] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2025-05-19 02:14:01,604] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
[2025-05-19 02:14:01,604] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x784ff9ddd550>
[2025-05-19 02:14:01,604] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]]
[2025-05-19 02:14:01,610] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   amp_params ................... False
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x78503c04fd50>
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2025-05-19 02:14:01,611] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   dump_state ................... False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   global_rank .................. 0
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 2
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   optimizer_name ............... adamw
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01}
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   pld_params ................... False
[2025-05-19 02:14:01,612] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   steps_per_print .............. inf
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   train_batch_size ............. 128
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  16
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... True
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   world_size ................... 4
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  False
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. True
[2025-05-19 02:14:01,613] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 1
[2025-05-19 02:14:01,613] [INFO] [config.py:989:print_user_config]   json = {
    "zero_optimization": {
        "stage": 1, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 1.000000e+09, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 1.000000e+09, 
        "contiguous_gradients": true
    }, 
    "fp16": {
        "enabled": false, 
        "auto_cast": true, 
        "loss_scale": 0, 
        "initial_scale_power": 32, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 4e-05, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.01
        }
    }, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 128, 
    "train_micro_batch_size_per_gpu": 16, 
    "wall_clock_breakdown": true
}
[INFO|trainer.py:2361] 2025-05-19 02:14:01,614 >> ***** Running training *****
[INFO|trainer.py:2362] 2025-05-19 02:14:01,615 >>   Num examples = 28,826
[INFO|trainer.py:2363] 2025-05-19 02:14:01,615 >>   Num Epochs = 1
[INFO|trainer.py:2364] 2025-05-19 02:14:01,615 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:2367] 2025-05-19 02:14:01,615 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:2368] 2025-05-19 02:14:01,615 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:2369] 2025-05-19 02:14:01,615 >>   Total optimization steps = 225
[INFO|trainer.py:2370] 2025-05-19 02:14:01,620 >>   Number of trainable parameters = 8,798,208
[INFO|integration_utils.py:811] 2025-05-19 02:14:01,625 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: tienanh2003 (tienanh) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.6
wandb: Run data is saved locally in /kaggle/working/Vintern/internvl_chat/wandb/run-20250519_021401-stvnubd3
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run Finetune_OCR
wandb: ⭐️ View project at https://wandb.ai/tienanh/fine-tuned-vintern1B_v3_5_OCR
wandb: 🚀 View run at https://wandb.ai/tienanh/fine-tuned-vintern1B_v3_5_OCR/runs/stvnubd3
  0%|          | 0/225 [00:00<?, ?it/s][2025-05-19 02:14:03,170] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:14:03,179] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:14:03,181] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:14:05,644] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620846.828555     968 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620846.828555     967 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620846.828556     969 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620846.835213     969 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747620846.835211     968 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747620846.835247     967 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620849.278932    1056 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620849.285450    1056 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
[2025-05-19 02:14:13,757] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:14:13,784] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:14:13,798] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:14:16,146] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620857.385461    1327 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620857.385704    1326 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620857.392043    1327 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747620857.392359    1326 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620857.414825    1328 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620857.421663    1328 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620859.762490    1388 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620859.769082    1388 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
[2025-05-19 02:14:24,260] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:14:24,327] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:14:24,328] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:14:26,622] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620867.875273    1659 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620867.881925    1659 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620867.935746    1658 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620867.942399    1658 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620867.988252    1660 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620867.995217    1660 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620870.267289    1720 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620870.273889    1720 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
[2025-05-19 02:14:34,786] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:14:34,818] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:14:34,880] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:14:37,240] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620878.385045    1990 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620878.391722    1990 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620878.464119    1991 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620878.464295    1992 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620878.470652    1991 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747620878.470866    1992 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747620880.837982    2054 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747620880.844520    2054 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
[rank2]: Traceback (most recent call last):
[rank2]:   File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
[rank2]:     main()
[rank2]:   File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
[rank2]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2155, in train
[rank2]:     return inner_training_loop(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2522, in _inner_training_loop
[rank2]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3688, in training_step
[rank2]:     self.accelerator.backward(loss, **kwargs)
[rank2]:   File "/usr/local/lib/python3.11/dist-packages/accelerate/accelerator.py", line 2238, in backward
[rank2]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank2]:   File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/deepspeed.py", line 261, in backward
[rank2]:     self.engine.backward(loss, **kwargs)
[rank2]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank2]:     ret_val = func(*args, **kwargs)
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank2]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank2]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2058, in backward
[rank2]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank2]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank2]:     scaled_loss.backward(retain_graph=retain_graph)
[rank2]:   File "/usr/local/lib/python3.11/dist-packages/torch/_tensor.py", line 581, in backward
[rank2]:     torch.autograd.backward(
[rank2]:   File "/usr/local/lib/python3.11/dist-packages/torch/autograd/__init__.py", line 347, in backward
[rank2]:     _engine_run_backward(
[rank2]:   File "/usr/local/lib/python3.11/dist-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank2]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.00 GiB. GPU 2 has a total capacity of 22.28 GiB of which 709.38 MiB is free. Process 16075 has 21.58 GiB memory in use. Of the allocated memory 18.60 GiB is allocated by PyTorch, and 2.65 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank3]: Traceback (most recent call last):
[rank3]:   File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
[rank3]:     main()
[rank3]:   File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
[rank3]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2155, in train
[rank3]:     return inner_training_loop(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2522, in _inner_training_loop
[rank3]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3655, in training_step
[rank3]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3709, in compute_loss
[rank3]:     outputs = model(**inputs)
[rank3]:               ^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:               ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1899, in forward
[rank3]:     loss = self.module(*inputs, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/kaggle/working/Vintern/internvl_chat/internvl/model/internvl_chat/modeling_internvl_chat.py", line 202, in forward
[rank3]:     loss = loss_fct(shift_logits, shift_labels)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/loss.py", line 1293, in forward
[rank3]:     return F.cross_entropy(
[rank3]:            ^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py", line 3479, in cross_entropy
[rank3]:     return torch._C._nn.cross_entropy_loss(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.43 GiB. GPU 3 has a total capacity of 22.28 GiB of which 5.72 GiB is free. Process 16076 has 16.55 GiB memory in use. Of the allocated memory 15.83 GiB is allocated by PyTorch, and 387.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]:   File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
[rank1]:     main()
[rank1]:   File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
[rank1]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2155, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2522, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3655, in training_step
[rank1]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3709, in compute_loss
[rank1]:     outputs = model(**inputs)
[rank1]:               ^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1899, in forward
[rank1]:     loss = self.module(*inputs, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/kaggle/working/Vintern/internvl_chat/internvl/model/internvl_chat/modeling_internvl_chat.py", line 202, in forward
[rank1]:     loss = loss_fct(shift_logits, shift_labels)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/loss.py", line 1293, in forward
[rank1]:     return F.cross_entropy(
[rank1]:            ^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py", line 3479, in cross_entropy
[rank1]:     return torch._C._nn.cross_entropy_loss(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.42 GiB. GPU 1 has a total capacity of 22.28 GiB of which 5.73 GiB is free. Process 16074 has 16.54 GiB memory in use. Of the allocated memory 15.83 GiB is allocated by PyTorch, and 386.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
dynamic ViT batch size: 16, images per sample: 1.0, dynamic token length: 1340
Traceback (most recent call last):
  File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
    main()
  File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2155, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2522, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3688, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/accelerate/accelerator.py", line 2238, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/deepspeed.py", line 261, in backward
    self.engine.backward(loss, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 2020, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2058, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.11/dist-packages/torch/_tensor.py", line 581, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.11/dist-packages/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/usr/local/lib/python3.11/dist-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.05 GiB. GPU 0 has a total capacity of 22.28 GiB of which 537.38 MiB is free. Process 16073 has 21.75 GiB memory in use. Of the allocated memory 18.71 GiB is allocated by PyTorch, and 2.70 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in <module>
[rank0]:     main()
[rank0]:   File "/kaggle/working/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
[rank0]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2155, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2522, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3688, in training_step
[rank0]:     self.accelerator.backward(loss, **kwargs)
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/accelerate/accelerator.py", line 2238, in backward
[rank0]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/deepspeed.py", line 261, in backward
[rank0]:     self.engine.backward(loss, **kwargs)
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank0]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2058, in backward
[rank0]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]:     scaled_loss.backward(retain_graph=retain_graph)
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/_tensor.py", line 581, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/autograd/__init__.py", line 347, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.05 GiB. GPU 0 has a total capacity of 22.28 GiB of which 537.38 MiB is free. Process 16073 has 21.75 GiB memory in use. Of the allocated memory 18.71 GiB is allocated by PyTorch, and 2.70 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[1;34mwandb[0m: 
[1;34mwandb[0m: 🚀 View run [33mFinetune_OCR[0m at: [34mhttps://wandb.ai/tienanh/fine-tuned-vintern1B_v3_5_OCR/runs/stvnubd3[0m
[1;34mwandb[0m: Find logs at: [1;35mwandb/run-20250519_021401-stvnubd3/logs[0m
W0519 02:14:49.552000 743 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 746 closing signal SIGTERM
E0519 02:14:49.967000 743 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 747) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
internvl/train/internvl_chat_finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-05-19_02:14:49
  host      : 27c18ac09229
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 748)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2025-05-19_02:14:49
  host      : 27c18ac09229
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 749)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-19_02:14:49
  host      : 27c18ac09229
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 747)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
W0519 02:16:50.115000 2646 torch/distributed/run.py:793] 
W0519 02:16:50.115000 2646 torch/distributed/run.py:793] *****************************************
W0519 02:16:50.115000 2646 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0519 02:16:50.115000 2646 torch/distributed/run.py:793] *****************************************
[2025-05-19 02:16:52,314] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:16:52,314] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:16:52,345] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:16:52,366] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621015.959415    2650 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621015.959416    2651 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621015.959417    2652 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621015.959605    2649 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621015.966111    2651 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747621015.966116    2652 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747621015.966126    2650 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747621015.966315    2649 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0519 02:17:10.094000 2788 torch/distributed/run.py:793] 
W0519 02:17:10.094000 2788 torch/distributed/run.py:793] *****************************************
W0519 02:17:10.094000 2788 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0519 02:17:10.094000 2788 torch/distributed/run.py:793] *****************************************
[2025-05-19 02:17:12,294] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:17:12,306] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:17:12,343] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:17:12,344] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621035.902815    2793 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621035.909401    2793 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621035.957164    2792 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621035.957450    2791 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621035.957448    2794 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621035.963818    2792 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747621035.964008    2794 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747621035.964225    2791 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
[2025-05-19 02:17:20,100] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-19 02:17:20,100] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:20,198 >> loading file chat_template.jinja
[2025-05-19 02:17:20,210] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-19 02:17:20,238] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-19 02:17:20,243] [INFO] [comm.py:652:init_distributed] cdb=None
[INFO|tokenization_utils_base.py:2300] 2025-05-19 02:17:20,472 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:693] 2025-05-19 02:17:20,479 >> loading configuration file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/config.json
[INFO|configuration_utils.py:762] 2025-05-19 02:17:20,481 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "5CD-AI/Vintern-1B-v3_5--configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "5CD-AI/Vintern-1B-v3_5--modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "5CD-AI/Vintern-1B-v3_5--modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "llm_config": {
    "_attn_implementation_autoset": false,
    "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
    "add_cross_attention": false,
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 151643,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 151645,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 896,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 4864,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "max_window_layers": 21,
    "min_length": 0,
    "model_type": "qwen2",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 14,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 24,
    "num_key_value_heads": 2,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-06,
    "rope_scaling": null,
    "rope_theta": 1000000.0,
    "sep_token_id": null,
    "sliding_window": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.47.0",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": false,
    "use_sliding_window": false,
    "vocab_size": 151674
  },
  "max_dynamic_patch": 4,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "template": "Hermes-2",
  "torch_dtype": "float32",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_attn_implementation_autoset": false,
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "no_repeat_ngram_size": 0,
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.47.0",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true
  }
}

[INFO|modeling_utils.py:3950] 2025-05-19 02:17:20,481 >> loading weights file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/model.safetensors
[INFO|modeling_utils.py:1641] 2025-05-19 02:17:20,507 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1140] 2025-05-19 02:17:20,508 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:1140] 2025-05-19 02:17:20,570 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "use_cache": false
}

[INFO|modeling_utils.py:4849] 2025-05-19 02:17:22,417 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4857] 2025-05-19 02:17:22,417 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:1093] 2025-05-19 02:17:22,422 >> loading configuration file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/generation_config.json
[INFO|configuration_utils.py:1140] 2025-05-19 02:17:22,422 >> Generate config GenerationConfig {
  "eos_token_id": [
    151644,
    151645,
    151643
  ]
}

W0519 02:17:36.849000 2964 torch/distributed/run.py:793] 
W0519 02:17:36.849000 2964 torch/distributed/run.py:793] *****************************************
W0519 02:17:36.849000 2964 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0519 02:17:36.849000 2964 torch/distributed/run.py:793] *****************************************
[2025-05-19 02:17:39,047] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:17:39,054] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:17:39,056] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:17:39,057] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621062.691754    2970 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621062.691754    2967 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621062.691747    2969 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621062.691760    2968 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621062.698420    2970 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747621062.698441    2969 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747621062.698452    2967 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747621062.698472    2968 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
Replace train sampler!!Replace train sampler!!

Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. Using PIL to load images.petrel_client is not installed. Using PIL to load images.

petrel_client is not installed. Using PIL to load images.
[2025-05-19 02:17:46,807] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-19 02:17:46,807] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2028] 2025-05-19 02:17:46,913 >> loading file chat_template.jinja
[2025-05-19 02:17:46,970] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-19 02:17:46,970] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-05-19 02:17:46,971] [INFO] [comm.py:652:init_distributed] cdb=None
[INFO|tokenization_utils_base.py:2300] 2025-05-19 02:17:47,185 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:693] 2025-05-19 02:17:47,193 >> loading configuration file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/config.json
[INFO|configuration_utils.py:762] 2025-05-19 02:17:47,195 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "5CD-AI/Vintern-1B-v3_5--configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "5CD-AI/Vintern-1B-v3_5--modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "5CD-AI/Vintern-1B-v3_5--modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "llm_config": {
    "_attn_implementation_autoset": false,
    "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
    "add_cross_attention": false,
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 151643,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 151645,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 896,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 4864,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "max_window_layers": 21,
    "min_length": 0,
    "model_type": "qwen2",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 14,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 24,
    "num_key_value_heads": 2,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-06,
    "rope_scaling": null,
    "rope_theta": 1000000.0,
    "sep_token_id": null,
    "sliding_window": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.47.0",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": false,
    "use_sliding_window": false,
    "vocab_size": 151674
  },
  "max_dynamic_patch": 4,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "template": "Hermes-2",
  "torch_dtype": "float32",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_attn_implementation_autoset": false,
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "no_repeat_ngram_size": 0,
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.47.0",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true
  }
}

[INFO|modeling_utils.py:3950] 2025-05-19 02:17:47,195 >> loading weights file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/model.safetensors
[INFO|modeling_utils.py:1641] 2025-05-19 02:17:47,220 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1140] 2025-05-19 02:17:47,222 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:1140] 2025-05-19 02:17:47,281 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "use_cache": false
}

[INFO|modeling_utils.py:4849] 2025-05-19 02:17:49,107 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4857] 2025-05-19 02:17:49,107 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:1093] 2025-05-19 02:17:49,111 >> loading configuration file /kaggle/working/Vintern/pretrained/Vintern-1B-v3_5/generation_config.json
[INFO|configuration_utils.py:1140] 2025-05-19 02:17:49,112 >> Generate config GenerationConfig {
  "eos_token_id": [
    151644,
    151645,
    151643
  ]
}

[WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:17:55,152 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors
[WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:17:55,228 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors
[WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:17:55,241 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors
[WARNING|tokenization_utils_base.py:3928] 2025-05-19 02:17:55,307 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1659 > 1500). Running this sequence through the model will result in indexing errors
trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780
trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780
trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780
trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3780
[WARNING|trainer.py:796] 2025-05-19 02:18:05,034 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:796] 2025-05-19 02:18:05,034 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:796] 2025-05-19 02:18:05,043 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:796] 2025-05-19 02:18:05,043 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:796] 2025-05-19 02:18:05,204 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:796] 2025-05-19 02:18:05,204 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[INFO|trainer.py:734] 2025-05-19 02:18:13,851 >> Using auto half precision backend
[WARNING|trainer.py:796] 2025-05-19 02:18:14,211 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:796] 2025-05-19 02:18:14,211 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[2025-05-19 02:18:14,235] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.15.4, git-hash=unknown, git-branch=unknown
[2025-05-19 02:18:14,236] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
[2025-05-19 02:18:15,237] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...

Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.0371098518371582 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.10166311264038086 secondsTime to load fused_adam op: 0.10155916213989258 seconds

Time to load fused_adam op: 0.10158443450927734 seconds
[2025-05-19 02:18:15,491] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2025-05-19 02:18:15,492] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-05-19 02:18:15,523] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2025-05-19 02:18:15,523] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2025-05-19 02:18:15,523] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer
[2025-05-19 02:18:15,523] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000
[2025-05-19 02:18:15,523] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000
[2025-05-19 02:18:15,523] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2025-05-19 02:18:15,523] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
[2025-05-19 02:18:15,956] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-05-19 02:18:15,956] [INFO] [utils.py:782:see_memory_usage] MA 1.78 GB         Max_MA 1.79 GB         CA 1.88 GB         Max_CA 2 GB 
[2025-05-19 02:18:15,957] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 14.32 GB, percent = 7.6%
[2025-05-19 02:18:16,334] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-05-19 02:18:16,335] [INFO] [utils.py:782:see_memory_usage] MA 1.78 GB         Max_MA 1.79 GB         CA 1.88 GB         Max_CA 2 GB 
[2025-05-19 02:18:16,335] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 14.47 GB, percent = 7.7%
[2025-05-19 02:18:16,335] [INFO] [stage_1_and_2.py:544:__init__] optimizer state initialized
[2025-05-19 02:18:16,703] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-05-19 02:18:16,704] [INFO] [utils.py:782:see_memory_usage] MA 1.78 GB         Max_MA 1.78 GB         CA 1.88 GB         Max_CA 2 GB 
[2025-05-19 02:18:16,704] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 14.66 GB, percent = 7.8%
[2025-05-19 02:18:16,706] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2025-05-19 02:18:16,706] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
[2025-05-19 02:18:16,707] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7d44cc5d6410>
[2025-05-19 02:18:16,707] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]]
[2025-05-19 02:18:16,713] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2025-05-19 02:18:16,713] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   amp_params ................... False
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7d44cc3f6f10>
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   dump_state ................... False
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2025-05-19 02:18:16,714] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   global_rank .................. 0
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 2
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   optimizer_name ............... adamw
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01}
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   pld_params ................... False
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2025-05-19 02:18:16,715] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   steps_per_print .............. inf
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   train_batch_size ............. 64
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  8
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... True
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   world_size ................... 4
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  False
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. True
[2025-05-19 02:18:16,716] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 1
[2025-05-19 02:18:16,716] [INFO] [config.py:989:print_user_config]   json = {
    "zero_optimization": {
        "stage": 1, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 1.000000e+09, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 1.000000e+09, 
        "contiguous_gradients": true
    }, 
    "fp16": {
        "enabled": false, 
        "auto_cast": true, 
        "loss_scale": 0, 
        "initial_scale_power": 32, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 4e-05, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.01
        }
    }, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 64, 
    "train_micro_batch_size_per_gpu": 8, 
    "wall_clock_breakdown": true
}
[INFO|trainer.py:2361] 2025-05-19 02:18:16,717 >> ***** Running training *****
[INFO|trainer.py:2362] 2025-05-19 02:18:16,717 >>   Num examples = 28,826
[INFO|trainer.py:2363] 2025-05-19 02:18:16,718 >>   Num Epochs = 1
[INFO|trainer.py:2364] 2025-05-19 02:18:16,718 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:2367] 2025-05-19 02:18:16,718 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:2368] 2025-05-19 02:18:16,718 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:2369] 2025-05-19 02:18:16,718 >>   Total optimization steps = 450
[INFO|trainer.py:2370] 2025-05-19 02:18:16,722 >>   Number of trainable parameters = 8,798,208
[INFO|integration_utils.py:811] 2025-05-19 02:18:16,727 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: tienanh2003 (tienanh) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.6
wandb: Run data is saved locally in /kaggle/working/Vintern/internvl_chat/wandb/run-20250519_021816-5budu5nq
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run Finetune_OCR
wandb: ⭐️ View project at https://wandb.ai/tienanh/fine-tuned-vintern1B_v3_5_OCR
wandb: 🚀 View run at https://wandb.ai/tienanh/fine-tuned-vintern1B_v3_5_OCR/runs/5budu5nq
  0%|          | 0/450 [00:00<?, ?it/s][2025-05-19 02:18:18,227] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:18:18,250] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:18:18,259] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:18:20,444] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621101.875145    3159 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621101.875609    3160 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621101.875611    3161 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621101.881785    3159 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747621101.882228    3161 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747621101.882389    3160 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621104.049606    3242 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621104.056972    3242 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
[2025-05-19 02:18:28,777] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:18:28,787] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:18:28,829] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:18:31,023] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621112.418738    3520 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621112.422022    3521 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621112.425361    3520 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1747621112.428593    3521 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621112.442178    3519 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621112.448908    3519 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621114.632992    3578 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621114.639620    3578 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
[2025-05-19 02:18:39,370] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:18:39,370] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:18:39,385] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:18:41,606] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621122.981751    3853 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621122.988400    3853 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621123.037249    3852 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621123.043905    3852 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621123.049016    3851 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621123.056018    3851 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621125.239264    3904 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621125.246274    3904 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
[2025-05-19 02:18:49,907] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:18:49,943] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:18:49,999] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-19 02:18:52,120] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621133.558062    4184 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621133.564682    4184 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621133.568029    4185 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621133.574593    4185 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621133.600542    4186 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621133.607153    4186 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747621135.739818    4239 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747621135.746348    4239 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
/usr/local/lib/python3.11/dist-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:19:03,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 1326.49 | bwd_microstep: 1946.36 | bwd_inner_microstep: 1946.14 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.06
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1308
[2025-05-19 02:19:06,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.18 | optimizer_gradients: 43.52 | optimizer_step: 1.70
[2025-05-19 02:19:06,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.21 | bwd_microstep: 1441.89 | bwd_inner_microstep: 1433.73 | bwd_allreduce_microstep: 8.07 | step_microstep: 132.00
[2025-05-19 02:19:06,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2063.59 | bwd: 3388.26 | bwd_inner: 3379.91 | bwd_allreduce: 8.15 | step: 132.07
  0%|          | 1/450 [00:48<6:02:14, 48.41s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1405
[2025-05-19 02:19:08,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 803.59 | bwd_microstep: 1562.04 | bwd_inner_microstep: 1561.91 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.05
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1349
[2025-05-19 02:19:11,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.63 | optimizer_step: 0.34
[2025-05-19 02:19:11,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 787.89 | bwd_microstep: 1602.74 | bwd_inner_microstep: 1509.57 | bwd_allreduce_microstep: 93.08 | step_microstep: 10.80
[2025-05-19 02:19:11,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1591.45 | bwd: 3164.81 | bwd_inner: 3071.53 | bwd_allreduce: 93.15 | step: 10.86
  0%|          | 2/450 [00:53<2:49:57, 22.76s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1329
[2025-05-19 02:19:13,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 765.00 | bwd_microstep: 1476.94 | bwd_inner_microstep: 1476.81 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1420
[2025-05-19 02:19:16,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.62 | optimizer_step: 0.33
[2025-05-19 02:19:16,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 831.98 | bwd_microstep: 2516.60 | bwd_inner_microstep: 1608.36 | bwd_allreduce_microstep: 908.13 | step_microstep: 9.71
[2025-05-19 02:19:16,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1596.94 | bwd: 3993.56 | bwd_inner: 3085.23 | bwd_allreduce: 908.19 | step: 9.83
  1%|          | 3/450 [00:58<1:51:22, 14.95s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:19:19,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 866.22 | bwd_microstep: 1728.81 | bwd_inner_microstep: 1728.69 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1357
[2025-05-19 02:19:21,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.12 | optimizer_gradients: 0.64 | optimizer_step: 0.34
[2025-05-19 02:19:21,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 803.85 | bwd_microstep: 1533.76 | bwd_inner_microstep: 1526.06 | bwd_allreduce_microstep: 7.59 | step_microstep: 9.73
[2025-05-19 02:19:21,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1670.05 | bwd: 3262.60 | bwd_inner: 3254.82 | bwd_allreduce: 7.65 | step: 9.84
  1%|          | 4/450 [01:03<1:21:53, 11.02s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1319
[2025-05-19 02:19:23,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 768.57 | bwd_microstep: 1468.92 | bwd_inner_microstep: 1468.76 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1358
[2025-05-19 02:19:26,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.66 | optimizer_step: 0.32
[2025-05-19 02:19:26,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 799.85 | bwd_microstep: 2129.97 | bwd_inner_microstep: 1535.33 | bwd_allreduce_microstep: 594.51 | step_microstep: 10.51
[2025-05-19 02:19:26,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1568.40 | bwd: 3598.92 | bwd_inner: 3004.18 | bwd_allreduce: 594.59 | step: 10.61
  1%|          | 5/450 [01:09<1:06:12,  8.93s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1418
[2025-05-19 02:19:29,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 826.49 | bwd_microstep: 1613.08 | bwd_inner_microstep: 1612.95 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:19:32,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.17 | optimizer_gradients: 0.60 | optimizer_step: 0.35
[2025-05-19 02:19:32,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 892.28 | bwd_microstep: 1736.21 | bwd_inner_microstep: 1728.59 | bwd_allreduce_microstep: 7.53 | step_microstep: 9.61
[2025-05-19 02:19:32,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1718.76 | bwd: 3349.32 | bwd_inner: 3341.60 | bwd_allreduce: 7.59 | step: 9.71
  1%|▏         | 6/450 [01:14<56:28,  7.63s/it]  dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1458
[2025-05-19 02:19:34,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 856.67 | bwd_microstep: 1672.26 | bwd_inner_microstep: 1672.12 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1316
[2025-05-19 02:19:37,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.64 | optimizer_step: 0.33
[2025-05-19 02:19:37,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 773.70 | bwd_microstep: 1964.39 | bwd_inner_microstep: 1470.69 | bwd_allreduce_microstep: 493.59 | step_microstep: 9.54
[2025-05-19 02:19:37,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1630.35 | bwd: 3636.67 | bwd_inner: 3142.89 | bwd_allreduce: 493.65 | step: 9.64
  2%|▏         | 7/450 [01:19<50:46,  6.88s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1287
[2025-05-19 02:19:39,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.59 | bwd_microstep: 1427.41 | bwd_inner_microstep: 1427.26 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.08
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1249
[2025-05-19 02:19:42,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.93 | optimizer_gradients: 0.65 | optimizer_step: 0.33
[2025-05-19 02:19:42,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.43 | bwd_microstep: 2009.88 | bwd_inner_microstep: 1382.01 | bwd_allreduce_microstep: 627.74 | step_microstep: 9.48
[2025-05-19 02:19:42,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1488.99 | bwd: 3437.31 | bwd_inner: 2809.34 | bwd_allreduce: 627.83 | step: 9.58
  2%|▏         | 8/450 [01:24<46:12,  6.27s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1254
[2025-05-19 02:19:44,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.25 | bwd_microstep: 1390.28 | bwd_inner_microstep: 1390.13 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:19:47,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:19:47,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 899.43 | bwd_microstep: 1874.63 | bwd_inner_microstep: 1729.72 | bwd_allreduce_microstep: 144.82 | step_microstep: 10.77
[2025-05-19 02:19:47,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1628.65 | bwd: 3264.93 | bwd_inner: 3119.90 | bwd_allreduce: 144.88 | step: 10.88
  2%|▏         | 9/450 [01:29<43:03,  5.86s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1340
[2025-05-19 02:19:49,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 784.28 | bwd_microstep: 1505.89 | bwd_inner_microstep: 1505.74 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1238
[2025-05-19 02:19:51,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.61 | optimizer_step: 0.34
[2025-05-19 02:19:51,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.90 | bwd_microstep: 1503.84 | bwd_inner_microstep: 1367.75 | bwd_allreduce_microstep: 135.95 | step_microstep: 9.52
[2025-05-19 02:19:51,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1521.15 | bwd: 3009.75 | bwd_inner: 2873.59 | bwd_allreduce: 136.02 | step: 9.62
  2%|▏         | 10/450 [01:34<40:04,  5.47s/it]                                                {'loss': 1.2344, 'grad_norm': 0.5265277624130249, 'learning_rate': 2.8571428571428574e-05, 'epoch': 0.02}
  2%|▏         | 10/450 [01:34<40:04,  5.47s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1364
[2025-05-19 02:19:54,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 805.70 | bwd_microstep: 1553.54 | bwd_inner_microstep: 1553.41 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:19:57,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.12 | optimizer_gradients: 0.58 | optimizer_step: 0.35
[2025-05-19 02:19:57,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 900.93 | bwd_microstep: 1748.40 | bwd_inner_microstep: 1740.73 | bwd_allreduce_microstep: 7.57 | step_microstep: 9.50
[2025-05-19 02:19:57,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1706.60 | bwd: 3301.96 | bwd_inner: 3294.20 | bwd_allreduce: 7.63 | step: 9.59
  2%|▏         | 11/450 [01:39<39:20,  5.38s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1330
[2025-05-19 02:19:59,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 791.48 | bwd_microstep: 1500.14 | bwd_inner_microstep: 1500.01 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1368
[2025-05-19 02:20:02,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.61 | optimizer_step: 0.34
[2025-05-19 02:20:02,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 818.97 | bwd_microstep: 2255.73 | bwd_inner_microstep: 1549.75 | bwd_allreduce_microstep: 705.88 | step_microstep: 10.62
[2025-05-19 02:20:02,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1610.43 | bwd: 3755.89 | bwd_inner: 3049.82 | bwd_allreduce: 705.94 | step: 10.74
  3%|▎         | 12/450 [01:44<39:20,  5.39s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1216
[2025-05-19 02:20:04,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.47 | bwd_microstep: 1326.86 | bwd_inner_microstep: 1326.74 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.08
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1369
[2025-05-19 02:20:07,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.94 | optimizer_gradients: 0.66 | optimizer_step: 0.31
[2025-05-19 02:20:07,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 827.47 | bwd_microstep: 2445.02 | bwd_inner_microstep: 1560.71 | bwd_allreduce_microstep: 884.19 | step_microstep: 9.45
[2025-05-19 02:20:07,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1539.93 | bwd: 3771.91 | bwd_inner: 2887.53 | bwd_allreduce: 884.24 | step: 9.54
  3%|▎         | 13/450 [01:49<39:11,  5.38s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:20:10,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 897.73 | bwd_microstep: 1739.40 | bwd_inner_microstep: 1739.25 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1345
[2025-05-19 02:20:12,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.61 | optimizer_step: 0.33
[2025-05-19 02:20:12,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 824.24 | bwd_microstep: 1545.13 | bwd_inner_microstep: 1537.54 | bwd_allreduce_microstep: 7.49 | step_microstep: 9.43
[2025-05-19 02:20:12,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1721.93 | bwd: 3284.55 | bwd_inner: 3276.85 | bwd_allreduce: 7.56 | step: 9.53
  3%|▎         | 14/450 [01:55<38:24,  5.28s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1352
[2025-05-19 02:20:15,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 806.09 | bwd_microstep: 1541.23 | bwd_inner_microstep: 1541.10 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1395
[2025-05-19 02:20:17,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.94 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:20:17,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 846.91 | bwd_microstep: 1773.73 | bwd_inner_microstep: 1594.62 | bwd_allreduce_microstep: 178.97 | step_microstep: 9.40
[2025-05-19 02:20:17,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1652.98 | bwd: 3314.99 | bwd_inner: 3135.80 | bwd_allreduce: 179.02 | step: 9.50
  3%|▎         | 15/450 [02:00<37:44,  5.21s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1388
[2025-05-19 02:20:20,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 826.41 | bwd_microstep: 1587.71 | bwd_inner_microstep: 1587.56 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1420
[2025-05-19 02:20:22,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:20:22,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 864.08 | bwd_microstep: 1636.83 | bwd_inner_microstep: 1629.16 | bwd_allreduce_microstep: 7.58 | step_microstep: 9.39
[2025-05-19 02:20:22,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1690.46 | bwd: 3224.56 | bwd_inner: 3216.78 | bwd_allreduce: 7.64 | step: 9.48
  4%|▎         | 16/450 [02:05<37:07,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1426
[2025-05-19 02:20:25,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 855.36 | bwd_microstep: 1647.41 | bwd_inner_microstep: 1647.27 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1296
[2025-05-19 02:20:28,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.97 | optimizer_gradients: 0.61 | optimizer_step: 0.31
[2025-05-19 02:20:28,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 775.37 | bwd_microstep: 1915.29 | bwd_inner_microstep: 1447.25 | bwd_allreduce_microstep: 467.93 | step_microstep: 9.40
[2025-05-19 02:20:28,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1630.70 | bwd: 3562.72 | bwd_inner: 3094.59 | bwd_allreduce: 468.00 | step: 9.51
  4%|▍         | 17/450 [02:10<37:17,  5.17s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1420
[2025-05-19 02:20:30,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 841.21 | bwd_microstep: 1629.37 | bwd_inner_microstep: 1629.23 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1299
[2025-05-19 02:20:32,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.62 | optimizer_step: 0.31
[2025-05-19 02:20:32,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 780.45 | bwd_microstep: 1496.34 | bwd_inner_microstep: 1451.93 | bwd_allreduce_microstep: 44.32 | step_microstep: 9.44
[2025-05-19 02:20:32,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1621.63 | bwd: 3125.73 | bwd_inner: 3081.22 | bwd_allreduce: 44.38 | step: 9.55
  4%|▍         | 18/450 [02:15<36:24,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1428
[2025-05-19 02:20:35,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 847.13 | bwd_microstep: 1644.84 | bwd_inner_microstep: 1644.68 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1390
[2025-05-19 02:20:37,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.60 | optimizer_step: 0.34
[2025-05-19 02:20:37,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 833.74 | bwd_microstep: 1598.73 | bwd_inner_microstep: 1582.19 | bwd_allreduce_microstep: 16.41 | step_microstep: 9.96
[2025-05-19 02:20:37,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1680.83 | bwd: 3243.60 | bwd_inner: 3226.96 | bwd_allreduce: 16.48 | step: 10.07
  4%|▍         | 19/450 [02:20<36:09,  5.03s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1430
[2025-05-19 02:20:40,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 849.01 | bwd_microstep: 1640.17 | bwd_inner_microstep: 1640.04 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1354
[2025-05-19 02:20:42,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:20:42,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 818.04 | bwd_microstep: 1727.88 | bwd_inner_microstep: 1544.91 | bwd_allreduce_microstep: 182.87 | step_microstep: 9.44
[2025-05-19 02:20:42,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1667.03 | bwd: 3368.09 | bwd_inner: 3185.01 | bwd_allreduce: 182.92 | step: 9.54
  4%|▍         | 20/450 [02:25<36:11,  5.05s/it]                                                {'loss': 1.2121, 'grad_norm': 0.379756897687912, 'learning_rate': 3.998131205178063e-05, 'epoch': 0.04}
  4%|▍         | 20/450 [02:25<36:11,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1447
[2025-05-19 02:20:45,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 852.92 | bwd_microstep: 1665.44 | bwd_inner_microstep: 1665.29 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1343
[2025-05-19 02:20:47,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.73 | optimizer_gradients: 0.72 | optimizer_step: 0.33
[2025-05-19 02:20:47,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 805.01 | bwd_microstep: 1519.09 | bwd_inner_microstep: 1511.24 | bwd_allreduce_microstep: 7.73 | step_microstep: 11.47
[2025-05-19 02:20:47,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1657.91 | bwd: 3184.55 | bwd_inner: 3176.60 | bwd_allreduce: 7.79 | step: 11.56
  5%|▍         | 21/450 [02:30<35:47,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1306
[2025-05-19 02:20:50,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 771.56 | bwd_microstep: 1458.62 | bwd_inner_microstep: 1458.47 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1382
[2025-05-19 02:20:52,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.59 | optimizer_step: 0.31
[2025-05-19 02:20:52,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 829.08 | bwd_microstep: 1734.09 | bwd_inner_microstep: 1569.72 | bwd_allreduce_microstep: 164.25 | step_microstep: 11.77
[2025-05-19 02:20:52,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1600.61 | bwd: 3192.73 | bwd_inner: 3028.29 | bwd_allreduce: 164.31 | step: 11.87
  5%|▍         | 22/450 [02:34<35:22,  4.96s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1417
[2025-05-19 02:20:55,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 838.23 | bwd_microstep: 1629.59 | bwd_inner_microstep: 1629.46 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1432
[2025-05-19 02:20:57,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:20:57,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 851.54 | bwd_microstep: 1643.74 | bwd_inner_microstep: 1636.16 | bwd_allreduce_microstep: 7.49 | step_microstep: 10.11
[2025-05-19 02:20:57,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1689.75 | bwd: 3273.35 | bwd_inner: 3265.67 | bwd_allreduce: 7.55 | step: 10.23
  5%|▌         | 23/450 [02:39<35:24,  4.98s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:21:00,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 893.70 | bwd_microstep: 1746.19 | bwd_inner_microstep: 1746.04 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1235
[2025-05-19 02:21:02,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:21:02,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 747.98 | bwd_microstep: 1599.04 | bwd_inner_microstep: 1373.57 | bwd_allreduce_microstep: 225.35 | step_microstep: 9.36
[2025-05-19 02:21:02,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1641.64 | bwd: 3345.26 | bwd_inner: 3119.69 | bwd_allreduce: 225.42 | step: 9.47
  5%|▌         | 24/450 [02:44<35:27,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1314
[2025-05-19 02:21:05,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 774.46 | bwd_microstep: 1484.11 | bwd_inner_microstep: 1483.94 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1437
[2025-05-19 02:21:07,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.11 | optimizer_gradients: 0.64 | optimizer_step: 0.33
[2025-05-19 02:21:07,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 858.87 | bwd_microstep: 1987.69 | bwd_inner_microstep: 1647.35 | bwd_allreduce_microstep: 340.23 | step_microstep: 10.69
[2025-05-19 02:21:07,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1633.27 | bwd: 3471.82 | bwd_inner: 3131.38 | bwd_allreduce: 340.29 | step: 10.78
  6%|▌         | 25/450 [02:50<35:43,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1386
[2025-05-19 02:21:10,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 818.21 | bwd_microstep: 1576.30 | bwd_inner_microstep: 1576.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1375
[2025-05-19 02:21:13,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.66 | optimizer_step: 0.32
[2025-05-19 02:21:13,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 826.06 | bwd_microstep: 1800.48 | bwd_inner_microstep: 1557.83 | bwd_allreduce_microstep: 242.54 | step_microstep: 9.44
[2025-05-19 02:21:13,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1644.24 | bwd: 3376.82 | bwd_inner: 3134.05 | bwd_allreduce: 242.61 | step: 9.57
  6%|▌         | 26/450 [02:55<35:42,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1114
[2025-05-19 02:21:14,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 656.01 | bwd_microstep: 1193.26 | bwd_inner_microstep: 1193.10 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1401
[2025-05-19 02:21:18,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.69 | optimizer_step: 0.32
[2025-05-19 02:21:18,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 841.24 | bwd_microstep: 2271.73 | bwd_inner_microstep: 1586.79 | bwd_allreduce_microstep: 684.82 | step_microstep: 10.51
[2025-05-19 02:21:18,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1497.21 | bwd: 3465.01 | bwd_inner: 2779.95 | bwd_allreduce: 684.89 | step: 10.63
  6%|▌         | 27/450 [03:00<35:32,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1474
[2025-05-19 02:21:20,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 877.32 | bwd_microstep: 1726.19 | bwd_inner_microstep: 1726.06 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1367
[2025-05-19 02:21:23,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:21:23,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 825.58 | bwd_microstep: 1724.92 | bwd_inner_microstep: 1551.34 | bwd_allreduce_microstep: 173.47 | step_microstep: 9.53
[2025-05-19 02:21:23,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1702.88 | bwd: 3451.14 | bwd_inner: 3277.45 | bwd_allreduce: 173.54 | step: 9.65
  6%|▌         | 28/450 [03:05<35:48,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1491
[2025-05-19 02:21:25,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 889.53 | bwd_microstep: 1755.77 | bwd_inner_microstep: 1755.62 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1330
[2025-05-19 02:21:28,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:21:28,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 806.16 | bwd_microstep: 1668.57 | bwd_inner_microstep: 1505.09 | bwd_allreduce_microstep: 163.38 | step_microstep: 9.48
[2025-05-19 02:21:28,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1695.67 | bwd: 3424.36 | bwd_inner: 3260.77 | bwd_allreduce: 163.46 | step: 9.57
  6%|▋         | 29/450 [03:10<35:53,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1364
[2025-05-19 02:21:30,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 809.80 | bwd_microstep: 1556.66 | bwd_inner_microstep: 1556.52 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1316
[2025-05-19 02:21:33,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.88 | optimizer_gradients: 0.66 | optimizer_step: 0.32
[2025-05-19 02:21:33,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 793.65 | bwd_microstep: 1659.98 | bwd_inner_microstep: 1481.57 | bwd_allreduce_microstep: 178.32 | step_microstep: 9.55
[2025-05-19 02:21:33,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1603.43 | bwd: 3216.67 | bwd_inner: 3038.13 | bwd_allreduce: 178.37 | step: 9.65
  7%|▋         | 30/450 [03:15<35:18,  5.04s/it]                                                {'loss': 1.1586, 'grad_norm': 0.3232681453227997, 'learning_rate': 3.9867234372791834e-05, 'epoch': 0.07}
  7%|▋         | 30/450 [03:15<35:18,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1423
[2025-05-19 02:21:35,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 842.37 | bwd_microstep: 1644.89 | bwd_inner_microstep: 1644.74 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1374
[2025-05-19 02:21:38,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:21:38,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 825.92 | bwd_microstep: 2004.43 | bwd_inner_microstep: 1558.85 | bwd_allreduce_microstep: 445.48 | step_microstep: 9.44
[2025-05-19 02:21:38,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1668.26 | bwd: 3649.36 | bwd_inner: 3203.66 | bwd_allreduce: 445.55 | step: 9.54
  7%|▋         | 31/450 [03:20<35:54,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1283
[2025-05-19 02:21:40,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 754.50 | bwd_microstep: 1437.99 | bwd_inner_microstep: 1437.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1329
[2025-05-19 02:21:43,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.97 | optimizer_gradients: 0.67 | optimizer_step: 0.31
[2025-05-19 02:21:43,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 801.62 | bwd_microstep: 1942.70 | bwd_inner_microstep: 1507.04 | bwd_allreduce_microstep: 435.58 | step_microstep: 9.45
[2025-05-19 02:21:43,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1556.08 | bwd: 3380.72 | bwd_inner: 2944.95 | bwd_allreduce: 435.64 | step: 9.54
  7%|▋         | 32/450 [03:25<35:30,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1392
[2025-05-19 02:21:46,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 818.79 | bwd_microstep: 1587.69 | bwd_inner_microstep: 1587.54 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1322
[2025-05-19 02:21:48,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:21:48,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 796.97 | bwd_microstep: 1855.54 | bwd_inner_microstep: 1493.71 | bwd_allreduce_microstep: 361.72 | step_microstep: 9.46
[2025-05-19 02:21:48,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1615.73 | bwd: 3443.26 | bwd_inner: 3081.32 | bwd_allreduce: 361.79 | step: 9.57
  7%|▋         | 33/450 [03:30<35:27,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1357
[2025-05-19 02:21:51,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 802.44 | bwd_microstep: 1557.16 | bwd_inner_microstep: 1557.00 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1321
[2025-05-19 02:21:53,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:21:53,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 797.41 | bwd_microstep: 1742.38 | bwd_inner_microstep: 1491.78 | bwd_allreduce_microstep: 250.50 | step_microstep: 9.42
[2025-05-19 02:21:53,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1599.81 | bwd: 3299.57 | bwd_inner: 3048.83 | bwd_allreduce: 250.56 | step: 9.54
  8%|▊         | 34/450 [03:35<35:03,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:21:56,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 892.10 | bwd_microstep: 1742.44 | bwd_inner_microstep: 1742.31 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1371
[2025-05-19 02:21:58,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.70 | optimizer_gradients: 0.81 | optimizer_step: 0.32
[2025-05-19 02:21:58,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 823.49 | bwd_microstep: 1560.25 | bwd_inner_microstep: 1552.45 | bwd_allreduce_microstep: 7.70 | step_microstep: 10.79
[2025-05-19 02:21:58,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1715.57 | bwd: 3302.72 | bwd_inner: 3294.81 | bwd_allreduce: 7.76 | step: 10.88
  8%|▊         | 35/450 [03:40<35:00,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1346
[2025-05-19 02:22:01,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 803.60 | bwd_microstep: 1540.18 | bwd_inner_microstep: 1540.02 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1290
[2025-05-19 02:22:03,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.66 | optimizer_step: 0.31
[2025-05-19 02:22:03,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 773.18 | bwd_microstep: 1510.68 | bwd_inner_microstep: 1436.39 | bwd_allreduce_microstep: 74.16 | step_microstep: 9.95
[2025-05-19 02:22:03,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1576.75 | bwd: 3050.88 | bwd_inner: 2976.49 | bwd_allreduce: 74.22 | step: 10.05
  8%|▊         | 36/450 [03:45<34:08,  4.95s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1469
[2025-05-19 02:22:06,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 869.98 | bwd_microstep: 1698.09 | bwd_inner_microstep: 1697.95 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1282
[2025-05-19 02:22:08,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.62 | optimizer_step: 0.35
[2025-05-19 02:22:08,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 769.85 | bwd_microstep: 1668.20 | bwd_inner_microstep: 1434.83 | bwd_allreduce_microstep: 233.27 | step_microstep: 9.59
[2025-05-19 02:22:08,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1639.82 | bwd: 3366.32 | bwd_inner: 3132.84 | bwd_allreduce: 233.33 | step: 9.69
  8%|▊         | 37/450 [03:50<34:17,  4.98s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:22:11,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 891.51 | bwd_microstep: 1745.23 | bwd_inner_microstep: 1745.10 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1464
[2025-05-19 02:22:13,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:22:13,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 877.03 | bwd_microstep: 1682.16 | bwd_inner_microstep: 1674.57 | bwd_allreduce_microstep: 7.49 | step_microstep: 9.60
[2025-05-19 02:22:13,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1768.51 | bwd: 3427.41 | bwd_inner: 3419.72 | bwd_allreduce: 7.54 | step: 9.70
  8%|▊         | 38/450 [03:55<34:45,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1395
[2025-05-19 02:22:16,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 829.71 | bwd_microstep: 1585.61 | bwd_inner_microstep: 1585.47 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1295
[2025-05-19 02:22:18,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.59 | optimizer_step: 0.32
[2025-05-19 02:22:18,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 782.14 | bwd_microstep: 1520.74 | bwd_inner_microstep: 1443.70 | bwd_allreduce_microstep: 76.93 | step_microstep: 10.91
[2025-05-19 02:22:18,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1611.83 | bwd: 3106.37 | bwd_inner: 3029.25 | bwd_allreduce: 76.99 | step: 11.01
  9%|▊         | 39/450 [04:00<34:04,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:22:21,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 891.02 | bwd_microstep: 1744.06 | bwd_inner_microstep: 1743.91 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:22:23,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.13 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:22:23,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 912.31 | bwd_microstep: 1744.37 | bwd_inner_microstep: 1736.84 | bwd_allreduce_microstep: 7.44 | step_microstep: 9.69
[2025-05-19 02:22:23,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1803.30 | bwd: 3488.45 | bwd_inner: 3480.80 | bwd_allreduce: 7.51 | step: 9.79
  9%|▉         | 40/450 [04:06<34:45,  5.09s/it]                                                {'loss': 1.174, 'grad_norm': 0.31020429730415344, 'learning_rate': 3.96500525138119e-05, 'epoch': 0.09}
  9%|▉         | 40/450 [04:06<34:45,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1362
[2025-05-19 02:22:26,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 809.11 | bwd_microstep: 1547.73 | bwd_inner_microstep: 1547.59 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:22:28,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.14 | optimizer_gradients: 0.60 | optimizer_step: 0.31
[2025-05-19 02:22:28,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 903.80 | bwd_microstep: 1750.74 | bwd_inner_microstep: 1743.21 | bwd_allreduce_microstep: 7.45 | step_microstep: 10.98
[2025-05-19 02:22:28,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1712.87 | bwd: 3298.50 | bwd_inner: 3290.86 | bwd_allreduce: 7.51 | step: 11.07
  9%|▉         | 41/450 [04:11<34:37,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1342
[2025-05-19 02:22:31,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 795.10 | bwd_microstep: 1508.94 | bwd_inner_microstep: 1508.78 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1408
[2025-05-19 02:22:34,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:22:34,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 835.43 | bwd_microstep: 1950.89 | bwd_inner_microstep: 1576.01 | bwd_allreduce_microstep: 374.76 | step_microstep: 9.63
[2025-05-19 02:22:34,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1630.50 | bwd: 3459.85 | bwd_inner: 3084.87 | bwd_allreduce: 374.82 | step: 9.76
  9%|▉         | 42/450 [04:16<34:40,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1281
[2025-05-19 02:22:36,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 754.35 | bwd_microstep: 1429.64 | bwd_inner_microstep: 1429.50 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1442
[2025-05-19 02:22:39,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:22:39,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 869.91 | bwd_microstep: 1834.97 | bwd_inner_microstep: 1659.82 | bwd_allreduce_microstep: 175.03 | step_microstep: 10.66
[2025-05-19 02:22:39,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1624.23 | bwd: 3264.63 | bwd_inner: 3089.40 | bwd_allreduce: 175.09 | step: 10.75
 10%|▉         | 43/450 [04:21<34:16,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1469
[2025-05-19 02:22:41,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 868.12 | bwd_microstep: 1687.01 | bwd_inner_microstep: 1686.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1406
[2025-05-19 02:22:44,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:22:44,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 844.17 | bwd_microstep: 1601.28 | bwd_inner_microstep: 1593.59 | bwd_allreduce_microstep: 7.57 | step_microstep: 9.57
[2025-05-19 02:22:44,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1712.27 | bwd: 3288.31 | bwd_inner: 3280.56 | bwd_allreduce: 7.62 | step: 9.68
 10%|▉         | 44/450 [04:26<34:11,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1231
[2025-05-19 02:22:46,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.38 | bwd_microstep: 1371.54 | bwd_inner_microstep: 1371.41 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1322
[2025-05-19 02:22:49,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.64 | optimizer_step: 0.33
[2025-05-19 02:22:49,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 798.97 | bwd_microstep: 2172.25 | bwd_inner_microstep: 1489.45 | bwd_allreduce_microstep: 682.70 | step_microstep: 9.63
[2025-05-19 02:22:49,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1527.33 | bwd: 3543.82 | bwd_inner: 2860.91 | bwd_allreduce: 682.75 | step: 9.73
 10%|█         | 45/450 [04:31<34:14,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1331
[2025-05-19 02:22:51,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 787.26 | bwd_microstep: 1504.91 | bwd_inner_microstep: 1504.77 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1291
[2025-05-19 02:22:54,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.73 | optimizer_gradients: 0.70 | optimizer_step: 0.33
[2025-05-19 02:22:54,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 773.31 | bwd_microstep: 1685.66 | bwd_inner_microstep: 1436.48 | bwd_allreduce_microstep: 249.07 | step_microstep: 11.45
[2025-05-19 02:22:54,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1560.54 | bwd: 3190.60 | bwd_inner: 2941.31 | bwd_allreduce: 249.16 | step: 11.60
 10%|█         | 46/450 [04:36<33:37,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1352
[2025-05-19 02:22:56,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 793.77 | bwd_microstep: 1543.27 | bwd_inner_microstep: 1543.10 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1495
[2025-05-19 02:22:59,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.57 | optimizer_step: 0.34
[2025-05-19 02:22:59,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 904.05 | bwd_microstep: 1745.24 | bwd_inner_microstep: 1737.71 | bwd_allreduce_microstep: 7.44 | step_microstep: 9.53
[2025-05-19 02:22:59,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1697.79 | bwd: 3288.55 | bwd_inner: 3280.87 | bwd_allreduce: 7.51 | step: 9.64
 10%|█         | 47/450 [04:41<33:37,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1448
[2025-05-19 02:23:01,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 851.62 | bwd_microstep: 1655.29 | bwd_inner_microstep: 1655.16 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1138
[2025-05-19 02:23:03,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:23:03,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.36 | bwd_microstep: 1582.41 | bwd_inner_microstep: 1224.92 | bwd_allreduce_microstep: 357.39 | step_microstep: 9.47
[2025-05-19 02:23:03,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1533.95 | bwd: 3237.72 | bwd_inner: 2880.13 | bwd_allreduce: 357.45 | step: 9.58
 11%|█         | 48/450 [04:46<33:10,  4.95s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1235
[2025-05-19 02:23:06,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.15 | bwd_microstep: 1379.41 | bwd_inner_microstep: 1379.28 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1419
[2025-05-19 02:23:09,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.65 | optimizer_step: 0.33
[2025-05-19 02:23:09,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 857.85 | bwd_microstep: 2167.65 | bwd_inner_microstep: 1630.08 | bwd_allreduce_microstep: 537.46 | step_microstep: 9.67
[2025-05-19 02:23:09,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1586.96 | bwd: 3547.08 | bwd_inner: 3009.43 | bwd_allreduce: 537.52 | step: 9.76
 11%|█         | 49/450 [04:51<33:34,  5.02s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1376
[2025-05-19 02:23:11,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 804.30 | bwd_microstep: 1560.05 | bwd_inner_microstep: 1559.87 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1304
[2025-05-19 02:23:13,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.07 | optimizer_gradients: 0.62 | optimizer_step: 0.37
[2025-05-19 02:23:13,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 773.90 | bwd_microstep: 1732.92 | bwd_inner_microstep: 1453.24 | bwd_allreduce_microstep: 279.58 | step_microstep: 11.31
[2025-05-19 02:23:13,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1578.17 | bwd: 3292.99 | bwd_inner: 3013.18 | bwd_allreduce: 279.66 | step: 11.42
 11%|█         | 50/450 [04:56<33:17,  4.99s/it]                                                {'loss': 1.1592, 'grad_norm': 0.2907794415950775, 'learning_rate': 3.933089357472089e-05, 'epoch': 0.11}
 11%|█         | 50/450 [04:56<33:17,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1305
[2025-05-19 02:23:16,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 767.11 | bwd_microstep: 1460.54 | bwd_inner_microstep: 1460.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1483
[2025-05-19 02:23:19,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.79 | optimizer_gradients: 0.72 | optimizer_step: 0.60
[2025-05-19 02:23:19,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 899.86 | bwd_microstep: 1826.65 | bwd_inner_microstep: 1728.20 | bwd_allreduce_microstep: 98.32 | step_microstep: 11.59
[2025-05-19 02:23:19,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1666.94 | bwd: 3287.22 | bwd_inner: 3188.66 | bwd_allreduce: 98.39 | step: 11.68
 11%|█▏        | 51/450 [05:01<33:14,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:23:21,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 895.30 | bwd_microstep: 1745.83 | bwd_inner_microstep: 1745.68 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1419
[2025-05-19 02:23:24,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.61 | optimizer_step: 0.31
[2025-05-19 02:23:24,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 858.49 | bwd_microstep: 1686.30 | bwd_inner_microstep: 1630.85 | bwd_allreduce_microstep: 55.36 | step_microstep: 9.49
[2025-05-19 02:23:24,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1753.76 | bwd: 3432.15 | bwd_inner: 3376.59 | bwd_allreduce: 55.42 | step: 9.59
 12%|█▏        | 52/450 [05:06<33:38,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1388
[2025-05-19 02:23:26,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 821.60 | bwd_microstep: 1580.89 | bwd_inner_microstep: 1580.76 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1346
[2025-05-19 02:23:29,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:23:29,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 816.84 | bwd_microstep: 1941.39 | bwd_inner_microstep: 1537.59 | bwd_allreduce_microstep: 403.70 | step_microstep: 9.69
[2025-05-19 02:23:29,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1638.43 | bwd: 3522.31 | bwd_inner: 3118.41 | bwd_allreduce: 403.76 | step: 9.79
 12%|█▏        | 53/450 [05:11<33:50,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:23:32,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 892.82 | bwd_microstep: 1751.15 | bwd_inner_microstep: 1751.02 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:23:34,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:23:34,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 907.91 | bwd_microstep: 1754.22 | bwd_inner_microstep: 1746.63 | bwd_allreduce_microstep: 7.50 | step_microstep: 9.55
[2025-05-19 02:23:34,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1800.69 | bwd: 3505.39 | bwd_inner: 3497.70 | bwd_allreduce: 7.56 | step: 9.65
 12%|█▏        | 54/450 [05:17<34:14,  5.19s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:23:37,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 898.62 | bwd_microstep: 1747.83 | bwd_inner_microstep: 1747.70 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1455
[2025-05-19 02:23:40,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.72 | optimizer_gradients: 0.69 | optimizer_step: 0.32
[2025-05-19 02:23:40,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 880.71 | bwd_microstep: 1684.74 | bwd_inner_microstep: 1676.85 | bwd_allreduce_microstep: 7.79 | step_microstep: 11.23
[2025-05-19 02:23:40,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1779.31 | bwd: 3432.60 | bwd_inner: 3424.61 | bwd_allreduce: 7.85 | step: 11.35
 12%|█▏        | 55/450 [05:22<34:18,  5.21s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1382
[2025-05-19 02:23:42,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 821.07 | bwd_microstep: 1577.01 | bwd_inner_microstep: 1576.88 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1409
[2025-05-19 02:23:45,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.72 | optimizer_gradients: 0.71 | optimizer_step: 0.33
[2025-05-19 02:23:45,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 852.55 | bwd_microstep: 1816.12 | bwd_inner_microstep: 1618.25 | bwd_allreduce_microstep: 197.75 | step_microstep: 11.25
[2025-05-19 02:23:45,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1673.59 | bwd: 3393.16 | bwd_inner: 3195.20 | bwd_allreduce: 197.82 | step: 11.37
 12%|█▏        | 56/450 [05:27<34:02,  5.18s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1300
[2025-05-19 02:23:47,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 764.41 | bwd_microstep: 1455.83 | bwd_inner_microstep: 1455.68 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1474
[2025-05-19 02:23:50,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:23:50,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 896.31 | bwd_microstep: 1942.12 | bwd_inner_microstep: 1726.85 | bwd_allreduce_microstep: 215.18 | step_microstep: 9.54
[2025-05-19 02:23:50,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1660.71 | bwd: 3397.97 | bwd_inner: 3182.59 | bwd_allreduce: 215.24 | step: 9.64
 13%|█▎        | 57/450 [05:32<33:48,  5.16s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1437
[2025-05-19 02:23:52,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 851.03 | bwd_microstep: 1660.03 | bwd_inner_microstep: 1659.88 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1316
[2025-05-19 02:23:55,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.07 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:23:55,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 795.02 | bwd_microstep: 1735.17 | bwd_inner_microstep: 1484.10 | bwd_allreduce_microstep: 250.98 | step_microstep: 10.93
[2025-05-19 02:23:55,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1646.02 | bwd: 3395.24 | bwd_inner: 3144.05 | bwd_allreduce: 251.04 | step: 11.03
 13%|█▎        | 58/450 [05:37<33:35,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1372
[2025-05-19 02:23:57,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 812.10 | bwd_microstep: 1564.80 | bwd_inner_microstep: 1564.67 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1410
[2025-05-19 02:24:00,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:24:00,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 851.52 | bwd_microstep: 1822.54 | bwd_inner_microstep: 1625.08 | bwd_allreduce_microstep: 197.35 | step_microstep: 9.54
[2025-05-19 02:24:00,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1663.60 | bwd: 3387.36 | bwd_inner: 3189.83 | bwd_allreduce: 197.41 | step: 9.65
 13%|█▎        | 59/450 [05:42<33:26,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1282
[2025-05-19 02:24:02,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 754.17 | bwd_microstep: 1443.45 | bwd_inner_microstep: 1443.32 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.08
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1264
[2025-05-19 02:24:05,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.61 | optimizer_step: 0.34
[2025-05-19 02:24:05,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 752.79 | bwd_microstep: 1894.99 | bwd_inner_microstep: 1391.77 | bwd_allreduce_microstep: 503.13 | step_microstep: 10.69
[2025-05-19 02:24:05,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1506.93 | bwd: 3338.46 | bwd_inner: 2835.15 | bwd_allreduce: 503.19 | step: 10.78
 13%|█▎        | 60/450 [05:47<32:54,  5.06s/it]                                                {'loss': 1.1814, 'grad_norm': 0.3207675814628601, 'learning_rate': 3.8911413881713204e-05, 'epoch': 0.13}
 13%|█▎        | 60/450 [05:47<32:54,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:24:08,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 889.84 | bwd_microstep: 1747.57 | bwd_inner_microstep: 1747.41 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1287
[2025-05-19 02:24:10,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.63 | optimizer_step: 0.35
[2025-05-19 02:24:10,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 773.58 | bwd_microstep: 1789.88 | bwd_inner_microstep: 1436.34 | bwd_allreduce_microstep: 353.44 | step_microstep: 9.51
[2025-05-19 02:24:10,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1663.40 | bwd: 3537.47 | bwd_inner: 3183.83 | bwd_allreduce: 353.49 | step: 9.61
 14%|█▎        | 61/450 [05:52<33:11,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1205
[2025-05-19 02:24:12,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.58 | bwd_microstep: 1324.88 | bwd_inner_microstep: 1324.71 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1489
[2025-05-19 02:24:15,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:24:15,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 902.21 | bwd_microstep: 1941.76 | bwd_inner_microstep: 1728.52 | bwd_allreduce_microstep: 213.13 | step_microstep: 9.46
[2025-05-19 02:24:15,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1612.77 | bwd: 3266.66 | bwd_inner: 3053.29 | bwd_allreduce: 213.20 | step: 9.59
 14%|█▍        | 62/450 [05:57<32:44,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1341
[2025-05-19 02:24:17,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 793.72 | bwd_microstep: 1523.31 | bwd_inner_microstep: 1523.17 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1445
[2025-05-19 02:24:20,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:24:20,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 872.85 | bwd_microstep: 1971.59 | bwd_inner_microstep: 1671.39 | bwd_allreduce_microstep: 300.08 | step_microstep: 10.97
[2025-05-19 02:24:20,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1666.55 | bwd: 3494.92 | bwd_inner: 3194.64 | bwd_allreduce: 300.13 | step: 11.07
 14%|█▍        | 63/450 [06:03<32:57,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1419
[2025-05-19 02:24:23,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 839.67 | bwd_microstep: 1632.55 | bwd_inner_microstep: 1632.39 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1339
[2025-05-19 02:24:25,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.62 | optimizer_step: 0.34
[2025-05-19 02:24:25,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 806.99 | bwd_microstep: 1631.54 | bwd_inner_microstep: 1513.40 | bwd_allreduce_microstep: 118.03 | step_microstep: 9.52
[2025-05-19 02:24:25,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1646.63 | bwd: 3264.11 | bwd_inner: 3145.85 | bwd_allreduce: 118.10 | step: 9.65
 14%|█▍        | 64/450 [06:07<32:35,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1260
[2025-05-19 02:24:27,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.39 | bwd_microstep: 1401.96 | bwd_inner_microstep: 1401.82 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1346
[2025-05-19 02:24:30,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.62 | optimizer_step: 0.33
[2025-05-19 02:24:30,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 811.74 | bwd_microstep: 1898.64 | bwd_inner_microstep: 1535.20 | bwd_allreduce_microstep: 363.35 | step_microstep: 9.44
[2025-05-19 02:24:30,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1553.11 | bwd: 3300.62 | bwd_inner: 2937.07 | bwd_allreduce: 363.41 | step: 9.53
 14%|█▍        | 65/450 [06:12<32:11,  5.02s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:24:33,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 891.15 | bwd_microstep: 1736.82 | bwd_inner_microstep: 1736.69 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1324
[2025-05-19 02:24:35,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:24:35,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 804.28 | bwd_microstep: 1495.93 | bwd_inner_microstep: 1488.29 | bwd_allreduce_microstep: 7.56 | step_microstep: 11.05
[2025-05-19 02:24:35,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1695.40 | bwd: 3232.77 | bwd_inner: 3225.04 | bwd_allreduce: 7.61 | step: 11.15
 15%|█▍        | 66/450 [06:17<32:02,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1382
[2025-05-19 02:24:38,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 820.66 | bwd_microstep: 1575.09 | bwd_inner_microstep: 1574.93 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1363
[2025-05-19 02:24:40,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.59 | optimizer_step: 0.32
[2025-05-19 02:24:40,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 824.81 | bwd_microstep: 1660.49 | bwd_inner_microstep: 1554.25 | bwd_allreduce_microstep: 106.15 | step_microstep: 9.36
[2025-05-19 02:24:40,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1645.44 | bwd: 3235.60 | bwd_inner: 3129.23 | bwd_allreduce: 106.21 | step: 9.46
 15%|█▍        | 67/450 [06:22<31:49,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1459
[2025-05-19 02:24:43,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 867.44 | bwd_microstep: 1681.14 | bwd_inner_microstep: 1680.99 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1485
[2025-05-19 02:24:45,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.74 | optimizer_gradients: 0.67 | optimizer_step: 0.33
[2025-05-19 02:24:45,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 899.45 | bwd_microstep: 1736.86 | bwd_inner_microstep: 1728.99 | bwd_allreduce_microstep: 7.78 | step_microstep: 11.19
[2025-05-19 02:24:45,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1766.87 | bwd: 3418.03 | bwd_inner: 3410.03 | bwd_allreduce: 7.83 | step: 11.30
 15%|█▌        | 68/450 [06:28<32:13,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1353
[2025-05-19 02:24:48,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 807.66 | bwd_microstep: 1541.52 | bwd_inner_microstep: 1541.36 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1175
[2025-05-19 02:24:51,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.62 | optimizer_step: 0.33
[2025-05-19 02:24:51,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.71 | bwd_microstep: 2124.67 | bwd_inner_microstep: 1277.55 | bwd_allreduce_microstep: 847.02 | step_microstep: 9.42
[2025-05-19 02:24:51,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1513.33 | bwd: 3666.21 | bwd_inner: 2818.98 | bwd_allreduce: 847.09 | step: 9.54
 15%|█▌        | 69/450 [06:33<32:28,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1396
[2025-05-19 02:24:53,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 823.80 | bwd_microstep: 1589.97 | bwd_inner_microstep: 1589.81 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1389
[2025-05-19 02:24:55,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.59 | optimizer_step: 0.32
[2025-05-19 02:24:55,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 835.63 | bwd_microstep: 1591.26 | bwd_inner_microstep: 1583.71 | bwd_allreduce_microstep: 7.44 | step_microstep: 9.34
[2025-05-19 02:24:55,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1659.41 | bwd: 3181.26 | bwd_inner: 3173.58 | bwd_allreduce: 7.51 | step: 9.44
 16%|█▌        | 70/450 [06:38<31:57,  5.05s/it]                                                {'loss': 1.15, 'grad_norm': 0.355338990688324, 'learning_rate': 3.8393790391528726e-05, 'epoch': 0.16}
 16%|█▌        | 70/450 [06:38<31:57,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1376
[2025-05-19 02:24:58,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 810.53 | bwd_microstep: 1555.45 | bwd_inner_microstep: 1555.27 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1330
[2025-05-19 02:25:01,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:25:01,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 802.98 | bwd_microstep: 2151.53 | bwd_inner_microstep: 1498.85 | bwd_allreduce_microstep: 652.60 | step_microstep: 9.43
[2025-05-19 02:25:01,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1613.49 | bwd: 3707.01 | bwd_inner: 3054.18 | bwd_allreduce: 652.68 | step: 9.54
 16%|█▌        | 71/450 [06:43<32:30,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1377
[2025-05-19 02:25:03,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 816.82 | bwd_microstep: 1575.50 | bwd_inner_microstep: 1575.33 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1302
[2025-05-19 02:25:06,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:25:06,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 781.89 | bwd_microstep: 1751.48 | bwd_inner_microstep: 1461.57 | bwd_allreduce_microstep: 289.83 | step_microstep: 10.64
[2025-05-19 02:25:06,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1598.68 | bwd: 3327.01 | bwd_inner: 3036.94 | bwd_allreduce: 289.90 | step: 10.75
 16%|█▌        | 72/450 [06:48<32:06,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1381
[2025-05-19 02:25:08,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 816.91 | bwd_microstep: 1575.97 | bwd_inner_microstep: 1575.83 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1285
[2025-05-19 02:25:11,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:25:11,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 769.75 | bwd_microstep: 2150.46 | bwd_inner_microstep: 1436.43 | bwd_allreduce_microstep: 713.92 | step_microstep: 9.49
[2025-05-19 02:25:11,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1586.62 | bwd: 3726.45 | bwd_inner: 3012.31 | bwd_allreduce: 713.98 | step: 9.60
 16%|█▌        | 73/450 [06:53<32:31,  5.18s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:25:14,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 890.79 | bwd_microstep: 1742.76 | bwd_inner_microstep: 1742.59 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1364
[2025-05-19 02:25:16,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.85 | optimizer_gradients: 0.59 | optimizer_step: 0.31
[2025-05-19 02:25:16,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 828.44 | bwd_microstep: 1581.76 | bwd_inner_microstep: 1550.31 | bwd_allreduce_microstep: 31.34 | step_microstep: 10.59
[2025-05-19 02:25:16,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1719.19 | bwd: 3324.57 | bwd_inner: 3292.98 | bwd_allreduce: 31.41 | step: 10.69
 16%|█▋        | 74/450 [06:58<32:17,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1380
[2025-05-19 02:25:19,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 816.14 | bwd_microstep: 1576.89 | bwd_inner_microstep: 1576.75 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1385
[2025-05-19 02:25:21,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.86 | optimizer_gradients: 0.72 | optimizer_step: 0.33
[2025-05-19 02:25:21,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 833.95 | bwd_microstep: 1591.74 | bwd_inner_microstep: 1584.14 | bwd_allreduce_microstep: 7.51 | step_microstep: 10.04
[2025-05-19 02:25:21,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1650.05 | bwd: 3168.65 | bwd_inner: 3160.95 | bwd_allreduce: 7.56 | step: 10.15
 17%|█▋        | 75/450 [07:03<31:41,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1353
[2025-05-19 02:25:24,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 810.89 | bwd_microstep: 1546.24 | bwd_inner_microstep: 1546.07 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1277
[2025-05-19 02:25:26,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.10 | optimizer_gradients: 0.61 | optimizer_step: 0.33
[2025-05-19 02:25:26,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 761.39 | bwd_microstep: 1749.58 | bwd_inner_microstep: 1413.11 | bwd_allreduce_microstep: 336.37 | step_microstep: 10.62
[2025-05-19 02:25:26,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1572.25 | bwd: 3295.85 | bwd_inner: 2959.25 | bwd_allreduce: 336.45 | step: 10.73
 17%|█▋        | 76/450 [07:08<31:19,  5.03s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:25:29,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 892.85 | bwd_microstep: 1749.27 | bwd_inner_microstep: 1749.14 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1278
[2025-05-19 02:25:31,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.11 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:25:31,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 767.90 | bwd_microstep: 1418.82 | bwd_inner_microstep: 1411.22 | bwd_allreduce_microstep: 7.50 | step_microstep: 9.49
[2025-05-19 02:25:31,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1660.73 | bwd: 3168.12 | bwd_inner: 3160.42 | bwd_allreduce: 7.54 | step: 9.60
 17%|█▋        | 77/450 [07:13<30:58,  4.98s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:25:34,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 900.93 | bwd_microstep: 1745.65 | bwd_inner_microstep: 1745.49 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1109
[2025-05-19 02:25:36,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.61 | optimizer_step: 0.34
[2025-05-19 02:25:36,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.52 | bwd_microstep: 1813.50 | bwd_inner_microstep: 1199.94 | bwd_allreduce_microstep: 613.44 | step_microstep: 9.49
[2025-05-19 02:25:36,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1576.42 | bwd: 3559.18 | bwd_inner: 2945.48 | bwd_allreduce: 613.51 | step: 9.58
 17%|█▋        | 78/450 [07:18<31:16,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1451
[2025-05-19 02:25:39,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 858.16 | bwd_microstep: 1667.86 | bwd_inner_microstep: 1667.67 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1351
[2025-05-19 02:25:41,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.62 | optimizer_step: 0.34
[2025-05-19 02:25:41,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 817.76 | bwd_microstep: 1590.88 | bwd_inner_microstep: 1541.44 | bwd_allreduce_microstep: 49.32 | step_microstep: 9.69
[2025-05-19 02:25:41,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1675.87 | bwd: 3258.78 | bwd_inner: 3209.23 | bwd_allreduce: 49.39 | step: 9.80
 18%|█▊        | 79/450 [07:23<31:05,  5.03s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1498
[2025-05-19 02:25:44,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 892.89 | bwd_microstep: 1741.44 | bwd_inner_microstep: 1741.28 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1490
[2025-05-19 02:25:46,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:25:46,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 904.31 | bwd_microstep: 1743.23 | bwd_inner_microstep: 1735.61 | bwd_allreduce_microstep: 7.52 | step_microstep: 9.52
[2025-05-19 02:25:46,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1797.17 | bwd: 3484.69 | bwd_inner: 3476.96 | bwd_allreduce: 7.59 | step: 9.63
 18%|█▊        | 80/450 [07:29<31:34,  5.12s/it]                                                {'loss': 1.1558, 'grad_norm': 0.3535601794719696, 'learning_rate': 3.778070939378875e-05, 'epoch': 0.18}
 18%|█▊        | 80/450 [07:29<31:34,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1397
[2025-05-19 02:25:49,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 829.09 | bwd_microstep: 1596.59 | bwd_inner_microstep: 1596.46 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.08
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1363
[2025-05-19 02:25:51,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:25:51,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 826.74 | bwd_microstep: 1653.98 | bwd_inner_microstep: 1547.87 | bwd_allreduce_microstep: 106.01 | step_microstep: 9.42
[2025-05-19 02:25:51,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1655.82 | bwd: 3250.59 | bwd_inner: 3144.40 | bwd_allreduce: 106.07 | step: 9.51
 18%|█▊        | 81/450 [07:34<31:11,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1498
[2025-05-19 02:25:54,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 894.40 | bwd_microstep: 1745.59 | bwd_inner_microstep: 1745.43 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1398
[2025-05-19 02:25:57,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.07 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:25:57,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 845.93 | bwd_microstep: 1596.72 | bwd_inner_microstep: 1589.18 | bwd_allreduce_microstep: 7.46 | step_microstep: 10.86
[2025-05-19 02:25:57,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1740.29 | bwd: 3342.34 | bwd_inner: 3334.66 | bwd_allreduce: 7.53 | step: 10.98
 18%|█▊        | 82/450 [07:39<31:13,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1335
[2025-05-19 02:25:59,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 795.06 | bwd_microstep: 1512.93 | bwd_inner_microstep: 1512.80 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1252
[2025-05-19 02:26:01,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.97 | optimizer_gradients: 0.64 | optimizer_step: 0.31
[2025-05-19 02:26:01,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 751.55 | bwd_microstep: 1512.72 | bwd_inner_microstep: 1397.08 | bwd_allreduce_microstep: 115.54 | step_microstep: 9.59
[2025-05-19 02:26:01,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1546.58 | bwd: 3025.66 | bwd_inner: 2909.92 | bwd_allreduce: 115.60 | step: 9.69
 18%|█▊        | 83/450 [07:43<30:17,  4.95s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1478
[2025-05-19 02:26:04,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 885.05 | bwd_microstep: 1728.45 | bwd_inner_microstep: 1728.29 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1332
[2025-05-19 02:26:06,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.59 | optimizer_step: 0.31
[2025-05-19 02:26:06,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 806.22 | bwd_microstep: 1777.93 | bwd_inner_microstep: 1510.39 | bwd_allreduce_microstep: 267.43 | step_microstep: 9.44
[2025-05-19 02:26:06,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1691.24 | bwd: 3506.41 | bwd_inner: 3238.76 | bwd_allreduce: 267.51 | step: 9.55
 19%|█▊        | 84/450 [07:49<30:45,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1295
[2025-05-19 02:26:09,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 759.80 | bwd_microstep: 1449.65 | bwd_inner_microstep: 1449.49 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:26:11,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:26:11,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 907.58 | bwd_microstep: 1747.80 | bwd_inner_microstep: 1739.89 | bwd_allreduce_microstep: 7.81 | step_microstep: 9.37
[2025-05-19 02:26:11,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1667.34 | bwd: 3197.48 | bwd_inner: 3189.45 | bwd_allreduce: 7.88 | step: 9.48
 19%|█▉        | 85/450 [07:54<30:26,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1313
[2025-05-19 02:26:14,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 781.30 | bwd_microstep: 1474.83 | bwd_inner_microstep: 1474.69 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1307
[2025-05-19 02:26:17,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.62 | optimizer_step: 0.33
[2025-05-19 02:26:17,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 783.97 | bwd_microstep: 2133.97 | bwd_inner_microstep: 1460.59 | bwd_allreduce_microstep: 673.28 | step_microstep: 9.53
[2025-05-19 02:26:17,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1565.24 | bwd: 3608.83 | bwd_inner: 2935.34 | bwd_allreduce: 673.33 | step: 9.64
 19%|█▉        | 86/450 [07:59<30:45,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1330
[2025-05-19 02:26:19,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 788.63 | bwd_microstep: 1501.71 | bwd_inner_microstep: 1501.57 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:26:22,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.96 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:26:22,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 907.71 | bwd_microstep: 1987.86 | bwd_inner_microstep: 1739.15 | bwd_allreduce_microstep: 248.61 | step_microstep: 9.40
[2025-05-19 02:26:22,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1696.32 | bwd: 3489.58 | bwd_inner: 3240.78 | bwd_allreduce: 248.67 | step: 9.51
 19%|█▉        | 87/450 [08:04<30:59,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1348
[2025-05-19 02:26:24,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 797.34 | bwd_microstep: 1555.05 | bwd_inner_microstep: 1554.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:26:27,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.61 | optimizer_step: 0.33
[2025-05-19 02:26:27,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 908.48 | bwd_microstep: 1879.22 | bwd_inner_microstep: 1743.17 | bwd_allreduce_microstep: 135.96 | step_microstep: 9.42
[2025-05-19 02:26:27,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1705.79 | bwd: 3434.30 | bwd_inner: 3298.14 | bwd_allreduce: 136.01 | step: 9.53
 20%|█▉        | 88/450 [08:09<31:01,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1389
[2025-05-19 02:26:29,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 820.24 | bwd_microstep: 1581.40 | bwd_inner_microstep: 1581.26 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1322
[2025-05-19 02:26:32,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:26:32,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 796.09 | bwd_microstep: 1928.43 | bwd_inner_microstep: 1487.12 | bwd_allreduce_microstep: 441.19 | step_microstep: 9.53
[2025-05-19 02:26:32,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1616.31 | bwd: 3509.86 | bwd_inner: 3068.44 | bwd_allreduce: 441.27 | step: 9.62
 20%|█▉        | 89/450 [08:14<31:00,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:26:35,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 891.25 | bwd_microstep: 1742.74 | bwd_inner_microstep: 1742.60 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1384
[2025-05-19 02:26:37,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:26:37,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 828.35 | bwd_microstep: 1572.71 | bwd_inner_microstep: 1564.79 | bwd_allreduce_microstep: 7.83 | step_microstep: 9.49
[2025-05-19 02:26:37,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1719.57 | bwd: 3315.49 | bwd_inner: 3307.44 | bwd_allreduce: 7.86 | step: 9.58
 20%|██        | 90/450 [08:19<30:48,  5.13s/it]                                                {'loss': 1.1676, 'grad_norm': 0.40286821126937866, 'learning_rate': 3.707535257006777e-05, 'epoch': 0.2}
 20%|██        | 90/450 [08:19<30:48,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1450
[2025-05-19 02:26:40,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 859.42 | bwd_microstep: 1675.28 | bwd_inner_microstep: 1675.12 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:26:43,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.10 | optimizer_gradients: 0.60 | optimizer_step: 0.33
[2025-05-19 02:26:43,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 904.99 | bwd_microstep: 1747.15 | bwd_inner_microstep: 1739.42 | bwd_allreduce_microstep: 7.64 | step_microstep: 9.57
[2025-05-19 02:26:43,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1764.38 | bwd: 3422.46 | bwd_inner: 3414.60 | bwd_allreduce: 7.70 | step: 9.69
 20%|██        | 91/450 [08:25<30:54,  5.17s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1369
[2025-05-19 02:26:45,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 815.28 | bwd_microstep: 1545.15 | bwd_inner_microstep: 1544.97 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1377
[2025-05-19 02:26:47,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:26:47,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 828.86 | bwd_microstep: 1574.02 | bwd_inner_microstep: 1566.43 | bwd_allreduce_microstep: 7.47 | step_microstep: 9.31
[2025-05-19 02:26:47,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1644.10 | bwd: 3119.20 | bwd_inner: 3111.46 | bwd_allreduce: 7.55 | step: 9.40
 20%|██        | 92/450 [08:30<30:12,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1324
[2025-05-19 02:26:50,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 790.62 | bwd_microstep: 1492.16 | bwd_inner_microstep: 1492.03 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1396
[2025-05-19 02:26:53,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.62 | optimizer_step: 0.33
[2025-05-19 02:26:53,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 840.10 | bwd_microstep: 2179.08 | bwd_inner_microstep: 1584.34 | bwd_allreduce_microstep: 594.62 | step_microstep: 10.01
[2025-05-19 02:26:53,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1630.70 | bwd: 3671.27 | bwd_inner: 3076.44 | bwd_allreduce: 594.68 | step: 10.10
 21%|██        | 93/450 [08:35<30:38,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1480
[2025-05-19 02:26:55,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 871.33 | bwd_microstep: 1724.03 | bwd_inner_microstep: 1723.88 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1444
[2025-05-19 02:26:58,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.59 | optimizer_step: 0.32
[2025-05-19 02:26:58,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 872.44 | bwd_microstep: 1666.42 | bwd_inner_microstep: 1658.75 | bwd_allreduce_microstep: 7.57 | step_microstep: 9.46
[2025-05-19 02:26:58,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1743.75 | bwd: 3390.47 | bwd_inner: 3382.69 | bwd_allreduce: 7.64 | step: 9.56
 21%|██        | 94/450 [08:40<30:37,  5.16s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1398
[2025-05-19 02:27:00,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 831.00 | bwd_microstep: 1592.41 | bwd_inner_microstep: 1592.27 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1305
[2025-05-19 02:27:03,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:27:03,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 786.83 | bwd_microstep: 1597.71 | bwd_inner_microstep: 1454.18 | bwd_allreduce_microstep: 143.45 | step_microstep: 9.43
[2025-05-19 02:27:03,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1617.80 | bwd: 3190.15 | bwd_inner: 3046.51 | bwd_allreduce: 143.50 | step: 9.54
 21%|██        | 95/450 [08:45<30:00,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1433
[2025-05-19 02:27:05,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 847.55 | bwd_microstep: 1646.62 | bwd_inner_microstep: 1646.49 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1295
[2025-05-19 02:27:08,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.71 | optimizer_gradients: 0.69 | optimizer_step: 0.33
[2025-05-19 02:27:08,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 776.52 | bwd_microstep: 1461.18 | bwd_inner_microstep: 1446.59 | bwd_allreduce_microstep: 14.48 | step_microstep: 11.14
[2025-05-19 02:27:08,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1624.05 | bwd: 3107.82 | bwd_inner: 3093.13 | bwd_allreduce: 14.54 | step: 11.23
 21%|██▏       | 96/450 [08:50<29:24,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1443
[2025-05-19 02:27:10,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 860.39 | bwd_microstep: 1665.53 | bwd_inner_microstep: 1665.37 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1359
[2025-05-19 02:27:13,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:27:13,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 820.75 | bwd_microstep: 1954.22 | bwd_inner_microstep: 1551.78 | bwd_allreduce_microstep: 402.35 | step_microstep: 9.47
[2025-05-19 02:27:13,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1681.10 | bwd: 3619.80 | bwd_inner: 3217.21 | bwd_allreduce: 402.42 | step: 9.57
 22%|██▏       | 97/450 [08:55<29:58,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1320
[2025-05-19 02:27:15,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 776.29 | bwd_microstep: 1479.03 | bwd_inner_microstep: 1478.87 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1310
[2025-05-19 02:27:18,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:27:18,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 784.97 | bwd_microstep: 1992.39 | bwd_inner_microstep: 1461.36 | bwd_allreduce_microstep: 530.92 | step_microstep: 10.58
[2025-05-19 02:27:18,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1561.23 | bwd: 3471.45 | bwd_inner: 2940.28 | bwd_allreduce: 530.99 | step: 10.69
 22%|██▏       | 98/450 [09:00<29:52,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:27:21,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 893.27 | bwd_microstep: 1741.26 | bwd_inner_microstep: 1741.11 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1457
[2025-05-19 02:27:23,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.61 | optimizer_step: 0.31
[2025-05-19 02:27:23,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 887.59 | bwd_microstep: 1792.73 | bwd_inner_microstep: 1689.49 | bwd_allreduce_microstep: 103.13 | step_microstep: 9.37
[2025-05-19 02:27:23,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1780.83 | bwd: 3534.01 | bwd_inner: 3430.68 | bwd_allreduce: 103.19 | step: 9.47
 22%|██▏       | 99/450 [09:06<30:16,  5.18s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:27:26,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 894.28 | bwd_microstep: 1740.79 | bwd_inner_microstep: 1740.65 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1395
[2025-05-19 02:27:29,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:27:29,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 843.04 | bwd_microstep: 1816.12 | bwd_inner_microstep: 1592.28 | bwd_allreduce_microstep: 223.73 | step_microstep: 10.72
[2025-05-19 02:27:29,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1737.29 | bwd: 3556.94 | bwd_inner: 3333.02 | bwd_allreduce: 223.78 | step: 10.81
 22%|██▏       | 100/450 [09:11<30:29,  5.23s/it]                                                 {'loss': 1.1009, 'grad_norm': 0.348188579082489, 'learning_rate': 3.6281380482049743e-05, 'epoch': 0.22}
 22%|██▏       | 100/450 [09:11<30:29,  5.23s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1303
[2025-05-19 02:27:31,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 766.23 | bwd_microstep: 1460.22 | bwd_inner_microstep: 1460.08 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1418
[2025-05-19 02:27:34,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.70 | optimizer_gradients: 0.70 | optimizer_step: 0.35
[2025-05-19 02:27:34,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 856.62 | bwd_microstep: 1833.89 | bwd_inner_microstep: 1636.60 | bwd_allreduce_microstep: 197.19 | step_microstep: 11.22
[2025-05-19 02:27:34,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1622.82 | bwd: 3294.14 | bwd_inner: 3096.75 | bwd_allreduce: 197.25 | step: 11.32
 22%|██▏       | 101/450 [09:16<29:57,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1408
[2025-05-19 02:27:36,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 823.07 | bwd_microstep: 1590.42 | bwd_inner_microstep: 1590.29 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1297
[2025-05-19 02:27:39,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.61 | optimizer_step: 0.33
[2025-05-19 02:27:39,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 779.62 | bwd_microstep: 1713.31 | bwd_inner_microstep: 1450.41 | bwd_allreduce_microstep: 262.80 | step_microstep: 9.43
[2025-05-19 02:27:39,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1602.67 | bwd: 3303.75 | bwd_inner: 3040.75 | bwd_allreduce: 262.86 | step: 9.53
 23%|██▎       | 102/450 [09:21<29:32,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1155
[2025-05-19 02:27:41,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.67 | bwd_microstep: 1261.59 | bwd_inner_microstep: 1261.41 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:27:43,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.70 | optimizer_gradients: 0.67 | optimizer_step: 0.33
[2025-05-19 02:27:43,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 913.58 | bwd_microstep: 1748.32 | bwd_inner_microstep: 1740.41 | bwd_allreduce_microstep: 7.78 | step_microstep: 11.12
[2025-05-19 02:27:43,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1594.22 | bwd: 3009.94 | bwd_inner: 3001.88 | bwd_allreduce: 7.85 | step: 11.23
 23%|██▎       | 103/450 [09:25<28:42,  4.96s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1451
[2025-05-19 02:27:46,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 861.43 | bwd_microstep: 1674.28 | bwd_inner_microstep: 1674.11 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1318
[2025-05-19 02:27:48,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.68 | optimizer_step: 0.32
[2025-05-19 02:27:48,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 797.63 | bwd_microstep: 1684.63 | bwd_inner_microstep: 1481.59 | bwd_allreduce_microstep: 202.93 | step_microstep: 10.81
[2025-05-19 02:27:48,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1659.04 | bwd: 3358.94 | bwd_inner: 3155.77 | bwd_allreduce: 202.99 | step: 10.93
 23%|██▎       | 104/450 [09:31<28:48,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1318
[2025-05-19 02:27:51,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 780.31 | bwd_microstep: 1490.20 | bwd_inner_microstep: 1490.06 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1391
[2025-05-19 02:27:53,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.11 | optimizer_gradients: 0.59 | optimizer_step: 0.31
[2025-05-19 02:27:53,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 836.24 | bwd_microstep: 1587.28 | bwd_inner_microstep: 1579.55 | bwd_allreduce_microstep: 7.64 | step_microstep: 10.64
[2025-05-19 02:27:53,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1616.52 | bwd: 3077.51 | bwd_inner: 3069.66 | bwd_allreduce: 7.70 | step: 10.75
 23%|██▎       | 105/450 [09:35<28:17,  4.92s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1469
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
WARNING: tokenization mismatch: 690 vs. 689. #turn = 10. (ignored). This dataset is finetune-ocr-dataset.
[2025-05-19 02:27:56,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 880.10 | bwd_microstep: 1687.28 | bwd_inner_microstep: 1687.11 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1337
[2025-05-19 02:27:58,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.58 | optimizer_step: 0.33
[2025-05-19 02:27:58,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 810.78 | bwd_microstep: 1564.83 | bwd_inner_microstep: 1509.14 | bwd_allreduce_microstep: 55.57 | step_microstep: 9.45
[2025-05-19 02:27:58,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1690.84 | bwd: 3252.14 | bwd_inner: 3196.35 | bwd_allreduce: 55.64 | step: 9.58
 24%|██▎       | 106/450 [09:40<28:20,  4.94s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1410
[2025-05-19 02:28:01,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 838.46 | bwd_microstep: 1624.35 | bwd_inner_microstep: 1624.19 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1451
[2025-05-19 02:28:03,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:28:03,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 873.08 | bwd_microstep: 1956.38 | bwd_inner_microstep: 1665.53 | bwd_allreduce_microstep: 290.74 | step_microstep: 9.51
[2025-05-19 02:28:03,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1711.52 | bwd: 3580.75 | bwd_inner: 3289.80 | bwd_allreduce: 290.80 | step: 9.61
 24%|██▍       | 107/450 [09:46<28:56,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1281
[2025-05-19 02:28:06,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 754.25 | bwd_microstep: 1441.09 | bwd_inner_microstep: 1440.92 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1331
[2025-05-19 02:28:08,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:28:08,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 803.90 | bwd_microstep: 1640.78 | bwd_inner_microstep: 1505.60 | bwd_allreduce_microstep: 135.07 | step_microstep: 9.47
[2025-05-19 02:28:08,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1558.11 | bwd: 3081.89 | bwd_inner: 2946.57 | bwd_allreduce: 135.16 | step: 9.58
 24%|██▍       | 108/450 [09:50<28:13,  4.95s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1416
[2025-05-19 02:28:11,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 833.44 | bwd_microstep: 1631.43 | bwd_inner_microstep: 1631.29 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1365
[2025-05-19 02:28:13,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.60 | optimizer_step: 0.31
[2025-05-19 02:28:13,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 823.05 | bwd_microstep: 1652.16 | bwd_inner_microstep: 1549.19 | bwd_allreduce_microstep: 102.86 | step_microstep: 9.44
[2025-05-19 02:28:13,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1656.47 | bwd: 3283.62 | bwd_inner: 3180.55 | bwd_allreduce: 102.93 | step: 9.54
 24%|██▍       | 109/450 [09:55<28:13,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1260
[2025-05-19 02:28:15,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.03 | bwd_microstep: 1398.39 | bwd_inner_microstep: 1398.25 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1418
[2025-05-19 02:28:18,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:28:18,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 856.15 | bwd_microstep: 1674.54 | bwd_inner_microstep: 1629.80 | bwd_allreduce_microstep: 44.65 | step_microstep: 9.44
[2025-05-19 02:28:18,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1600.17 | bwd: 3072.95 | bwd_inner: 3028.12 | bwd_allreduce: 44.70 | step: 9.53
 24%|██▍       | 110/450 [10:00<27:43,  4.89s/it]                                                 {'loss': 1.1499, 'grad_norm': 0.365224152803421, 'learning_rate': 3.540291357445961e-05, 'epoch': 0.24}
 24%|██▍       | 110/450 [10:00<27:43,  4.89s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1292
[2025-05-19 02:28:20,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 761.87 | bwd_microstep: 1445.61 | bwd_inner_microstep: 1445.47 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1295
[2025-05-19 02:28:23,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.69 | optimizer_gradients: 0.66 | optimizer_step: 0.33
[2025-05-19 02:28:23,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 778.72 | bwd_microstep: 1873.93 | bwd_inner_microstep: 1454.86 | bwd_allreduce_microstep: 418.97 | step_microstep: 9.50
[2025-05-19 02:28:23,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1540.58 | bwd: 3319.56 | bwd_inner: 2900.40 | bwd_allreduce: 419.03 | step: 9.62
 25%|██▍       | 111/450 [10:05<27:41,  4.90s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1408
[2025-05-19 02:28:25,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 820.60 | bwd_microstep: 1585.36 | bwd_inner_microstep: 1585.17 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:28:28,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.14 | optimizer_gradients: 0.59 | optimizer_step: 0.32
[2025-05-19 02:28:28,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 908.34 | bwd_microstep: 1744.68 | bwd_inner_microstep: 1737.03 | bwd_allreduce_microstep: 7.54 | step_microstep: 9.66
[2025-05-19 02:28:28,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1728.92 | bwd: 3330.08 | bwd_inner: 3322.27 | bwd_allreduce: 7.63 | step: 9.76
 25%|██▍       | 112/450 [10:10<27:57,  4.96s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1413
[2025-05-19 02:28:30,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 846.39 | bwd_microstep: 1628.29 | bwd_inner_microstep: 1628.15 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1421
[2025-05-19 02:28:33,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.72 | optimizer_gradients: 0.70 | optimizer_step: 0.33
[2025-05-19 02:28:33,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 859.11 | bwd_microstep: 1649.09 | bwd_inner_microstep: 1641.21 | bwd_allreduce_microstep: 7.73 | step_microstep: 10.81
[2025-05-19 02:28:33,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1705.48 | bwd: 3277.40 | bwd_inner: 3269.47 | bwd_allreduce: 7.79 | step: 10.91
 25%|██▌       | 113/450 [10:15<28:00,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1322
[2025-05-19 02:28:35,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 783.96 | bwd_microstep: 1493.47 | bwd_inner_microstep: 1493.33 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1225
[2025-05-19 02:28:38,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.88 | optimizer_gradients: 0.61 | optimizer_step: 0.31
[2025-05-19 02:28:38,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.63 | bwd_microstep: 2101.15 | bwd_inner_microstep: 1366.28 | bwd_allreduce_microstep: 734.74 | step_microstep: 9.45
[2025-05-19 02:28:38,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1524.56 | bwd: 3594.64 | bwd_inner: 2859.67 | bwd_allreduce: 734.80 | step: 9.55
 25%|██▌       | 114/450 [10:20<28:14,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1314
[2025-05-19 02:28:40,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 774.20 | bwd_microstep: 1482.68 | bwd_inner_microstep: 1482.52 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1431
[2025-05-19 02:28:43,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.61 | optimizer_step: 0.33
[2025-05-19 02:28:43,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 861.36 | bwd_microstep: 1938.60 | bwd_inner_microstep: 1645.86 | bwd_allreduce_microstep: 292.62 | step_microstep: 9.59
[2025-05-19 02:28:43,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1635.53 | bwd: 3421.31 | bwd_inner: 3128.46 | bwd_allreduce: 292.69 | step: 9.70
 26%|██▌       | 115/450 [10:25<28:15,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1395
[2025-05-19 02:28:46,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 825.91 | bwd_microstep: 1593.46 | bwd_inner_microstep: 1593.33 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1086
[2025-05-19 02:28:48,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:28:48,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 657.07 | bwd_microstep: 2037.03 | bwd_inner_microstep: 1158.28 | bwd_allreduce_microstep: 878.65 | step_microstep: 9.64
[2025-05-19 02:28:48,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1482.96 | bwd: 3630.52 | bwd_inner: 2751.67 | bwd_allreduce: 878.71 | step: 9.73
 26%|██▌       | 116/450 [10:31<28:21,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:28:51,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 890.92 | bwd_microstep: 1747.83 | bwd_inner_microstep: 1747.68 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1207
[2025-05-19 02:28:54,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.93 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:28:54,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.97 | bwd_microstep: 1814.23 | bwd_inner_microstep: 1328.30 | bwd_allreduce_microstep: 485.82 | step_microstep: 9.55
[2025-05-19 02:28:54,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1619.87 | bwd: 3562.09 | bwd_inner: 3076.06 | bwd_allreduce: 485.88 | step: 9.65
 26%|██▌       | 117/450 [10:36<28:30,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1326
[2025-05-19 02:28:56,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 781.34 | bwd_microstep: 1500.69 | bwd_inner_microstep: 1500.53 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1337
[2025-05-19 02:28:59,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:28:59,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 805.32 | bwd_microstep: 2092.16 | bwd_inner_microstep: 1501.67 | bwd_allreduce_microstep: 590.37 | step_microstep: 9.59
[2025-05-19 02:28:59,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1586.63 | bwd: 3592.87 | bwd_inner: 3002.29 | bwd_allreduce: 590.45 | step: 9.70
 26%|██▌       | 118/450 [10:41<28:34,  5.17s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1479
[2025-05-19 02:29:01,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 880.00 | bwd_microstep: 1721.02 | bwd_inner_microstep: 1720.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1377
[2025-05-19 02:29:04,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.10 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:29:04,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 826.60 | bwd_microstep: 1697.29 | bwd_inner_microstep: 1566.11 | bwd_allreduce_microstep: 131.03 | step_microstep: 10.91
[2025-05-19 02:29:04,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1706.57 | bwd: 3418.33 | bwd_inner: 3287.11 | bwd_allreduce: 131.09 | step: 11.02
 26%|██▋       | 119/450 [10:46<28:31,  5.17s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1402
[2025-05-19 02:29:06,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 823.28 | bwd_microstep: 1596.58 | bwd_inner_microstep: 1596.44 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:29:09,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.62 | optimizer_step: 0.31
[2025-05-19 02:29:09,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 908.65 | bwd_microstep: 1796.08 | bwd_inner_microstep: 1742.85 | bwd_allreduce_microstep: 53.14 | step_microstep: 9.49
[2025-05-19 02:29:09,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1731.90 | bwd: 3392.68 | bwd_inner: 3339.34 | bwd_allreduce: 53.20 | step: 9.58
 27%|██▋       | 120/450 [10:51<28:26,  5.17s/it]                                                 {'loss': 1.1642, 'grad_norm': 0.3727099597454071, 'learning_rate': 3.4444510791358596e-05, 'epoch': 0.27}
 27%|██▋       | 120/450 [10:51<28:26,  5.17s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:29:12,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 891.46 | bwd_microstep: 1748.66 | bwd_inner_microstep: 1748.52 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:29:15,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.61 | optimizer_step: 0.34
[2025-05-19 02:29:15,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 910.96 | bwd_microstep: 1752.43 | bwd_inner_microstep: 1744.48 | bwd_allreduce_microstep: 7.83 | step_microstep: 9.66
[2025-05-19 02:29:15,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1802.39 | bwd: 3501.11 | bwd_inner: 3493.07 | bwd_allreduce: 7.88 | step: 9.75
 27%|██▋       | 121/450 [10:57<28:40,  5.23s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1436
[2025-05-19 02:29:17,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 852.41 | bwd_microstep: 1647.50 | bwd_inner_microstep: 1647.33 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1418
[2025-05-19 02:29:20,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.59 | optimizer_step: 0.33
[2025-05-19 02:29:20,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 854.93 | bwd_microstep: 1643.60 | bwd_inner_microstep: 1635.81 | bwd_allreduce_microstep: 7.67 | step_microstep: 9.68
[2025-05-19 02:29:20,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1707.31 | bwd: 3291.13 | bwd_inner: 3283.24 | bwd_allreduce: 7.74 | step: 9.81
 27%|██▋       | 122/450 [11:02<28:17,  5.18s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1354
[2025-05-19 02:29:22,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 805.96 | bwd_microstep: 1547.63 | bwd_inner_microstep: 1547.49 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1305
[2025-05-19 02:29:25,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.62 | optimizer_step: 0.33
[2025-05-19 02:29:25,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 780.86 | bwd_microstep: 1938.46 | bwd_inner_microstep: 1460.50 | bwd_allreduce_microstep: 477.84 | step_microstep: 10.27
[2025-05-19 02:29:25,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1586.81 | bwd: 3486.11 | bwd_inner: 3008.07 | bwd_allreduce: 477.91 | step: 10.36
 27%|██▋       | 123/450 [11:07<28:07,  5.16s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1266
[2025-05-19 02:29:27,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.00 | bwd_microstep: 1408.74 | bwd_inner_microstep: 1408.59 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:29:30,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.07 | optimizer_gradients: 0.63 | optimizer_step: 0.34
[2025-05-19 02:29:30,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 905.80 | bwd_microstep: 1753.56 | bwd_inner_microstep: 1745.91 | bwd_allreduce_microstep: 7.56 | step_microstep: 11.04
[2025-05-19 02:29:30,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1650.78 | bwd: 3162.32 | bwd_inner: 3154.56 | bwd_allreduce: 7.62 | step: 11.14
 28%|██▊       | 124/450 [11:12<27:33,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1497
[2025-05-19 02:29:32,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 895.19 | bwd_microstep: 1744.23 | bwd_inner_microstep: 1744.09 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1323
[2025-05-19 02:29:35,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.74 | optimizer_gradients: 0.73 | optimizer_step: 0.33
[2025-05-19 02:29:35,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 804.46 | bwd_microstep: 1497.28 | bwd_inner_microstep: 1489.54 | bwd_allreduce_microstep: 7.61 | step_microstep: 11.28
[2025-05-19 02:29:35,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1699.62 | bwd: 3241.53 | bwd_inner: 3233.70 | bwd_allreduce: 7.68 | step: 11.40
 28%|██▊       | 125/450 [11:17<27:21,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1435
[2025-05-19 02:29:37,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 855.96 | bwd_microstep: 1647.96 | bwd_inner_microstep: 1647.81 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1476
[2025-05-19 02:29:40,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:29:40,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 898.37 | bwd_microstep: 1808.82 | bwd_inner_microstep: 1727.80 | bwd_allreduce_microstep: 80.92 | step_microstep: 10.75
[2025-05-19 02:29:40,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1754.31 | bwd: 3456.80 | bwd_inner: 3375.67 | bwd_allreduce: 80.98 | step: 10.84
 28%|██▊       | 126/450 [11:22<27:37,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1332
[2025-05-19 02:29:42,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 788.02 | bwd_microstep: 1518.27 | bwd_inner_microstep: 1518.12 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1440
[2025-05-19 02:29:45,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.62 | optimizer_step: 0.33
[2025-05-19 02:29:45,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 859.57 | bwd_microstep: 1950.85 | bwd_inner_microstep: 1652.07 | bwd_allreduce_microstep: 298.68 | step_microstep: 10.67
[2025-05-19 02:29:45,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1647.56 | bwd: 3469.14 | bwd_inner: 3170.23 | bwd_allreduce: 298.74 | step: 10.76
 28%|██▊       | 127/450 [11:27<27:37,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1362
[2025-05-19 02:29:47,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 808.26 | bwd_microstep: 1547.73 | bwd_inner_microstep: 1547.58 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1324
[2025-05-19 02:29:50,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.63 | optimizer_step: 0.33
[2025-05-19 02:29:50,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 796.20 | bwd_microstep: 2148.70 | bwd_inner_microstep: 1489.40 | bwd_allreduce_microstep: 659.19 | step_microstep: 9.62
[2025-05-19 02:29:50,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1604.44 | bwd: 3696.46 | bwd_inner: 3037.02 | bwd_allreduce: 659.27 | step: 9.71
 28%|██▊       | 128/450 [11:33<27:53,  5.20s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1430
[2025-05-19 02:29:53,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 848.23 | bwd_microstep: 1653.23 | bwd_inner_microstep: 1653.09 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1322
[2025-05-19 02:29:55,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.81 | optimizer_gradients: 0.60 | optimizer_step: 0.31
[2025-05-19 02:29:55,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 796.86 | bwd_microstep: 1497.06 | bwd_inner_microstep: 1489.17 | bwd_allreduce_microstep: 7.77 | step_microstep: 10.11
[2025-05-19 02:29:55,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1645.05 | bwd: 3150.32 | bwd_inner: 3142.34 | bwd_allreduce: 7.82 | step: 10.20
 29%|██▊       | 129/450 [11:37<27:14,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1312
[2025-05-19 02:29:58,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 769.91 | bwd_microstep: 1460.31 | bwd_inner_microstep: 1460.17 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1275
[2025-05-19 02:30:00,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.84 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:30:00,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 757.45 | bwd_microstep: 1890.04 | bwd_inner_microstep: 1408.51 | bwd_allreduce_microstep: 481.41 | step_microstep: 9.44
[2025-05-19 02:30:00,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1527.34 | bwd: 3350.38 | bwd_inner: 2868.73 | bwd_allreduce: 481.47 | step: 9.55
 29%|██▉       | 130/450 [11:42<26:54,  5.04s/it]                                                 {'loss': 1.1217, 'grad_norm': 0.34396323561668396, 'learning_rate': 3.341114591677715e-05, 'epoch': 0.29}
 29%|██▉       | 130/450 [11:42<26:54,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1353
[2025-05-19 02:30:03,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 800.05 | bwd_microstep: 1549.56 | bwd_inner_microstep: 1549.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1291
[2025-05-19 02:30:05,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.92 | optimizer_gradients: 0.60 | optimizer_step: 0.33
[2025-05-19 02:30:05,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 774.59 | bwd_microstep: 1777.57 | bwd_inner_microstep: 1441.22 | bwd_allreduce_microstep: 336.25 | step_microstep: 9.40
[2025-05-19 02:30:05,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1574.61 | bwd: 3327.16 | bwd_inner: 2990.66 | bwd_allreduce: 336.33 | step: 9.50
 29%|██▉       | 131/450 [11:47<26:40,  5.02s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1464
[2025-05-19 02:30:08,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 859.50 | bwd_microstep: 1672.96 | bwd_inner_microstep: 1672.80 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1431
[2025-05-19 02:30:10,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.62 | optimizer_step: 0.34
[2025-05-19 02:30:10,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 862.54 | bwd_microstep: 1646.36 | bwd_inner_microstep: 1638.43 | bwd_allreduce_microstep: 7.82 | step_microstep: 9.60
[2025-05-19 02:30:10,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1722.00 | bwd: 3319.35 | bwd_inner: 3311.30 | bwd_allreduce: 7.89 | step: 9.71
 29%|██▉       | 132/450 [11:52<26:43,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1426
[2025-05-19 02:30:13,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 854.02 | bwd_microstep: 1638.35 | bwd_inner_microstep: 1638.21 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1274
[2025-05-19 02:30:15,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:30:15,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 761.86 | bwd_microstep: 1786.47 | bwd_inner_microstep: 1409.04 | bwd_allreduce_microstep: 377.34 | step_microstep: 9.63
[2025-05-19 02:30:15,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1615.86 | bwd: 3424.85 | bwd_inner: 3047.30 | bwd_allreduce: 377.40 | step: 9.73
 30%|██▉       | 133/450 [11:58<26:43,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1333
[2025-05-19 02:30:18,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 790.35 | bwd_microstep: 1510.67 | bwd_inner_microstep: 1510.49 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1306
[2025-05-19 02:30:21,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:30:21,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 781.70 | bwd_microstep: 2065.06 | bwd_inner_microstep: 1459.17 | bwd_allreduce_microstep: 605.81 | step_microstep: 9.65
[2025-05-19 02:30:21,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1572.02 | bwd: 3575.76 | bwd_inner: 2969.72 | bwd_allreduce: 605.88 | step: 9.77
 30%|██▉       | 134/450 [12:03<26:51,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1420
[2025-05-19 02:30:23,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 838.89 | bwd_microstep: 1634.07 | bwd_inner_microstep: 1633.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1352
[2025-05-19 02:30:26,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.91 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:30:26,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 811.06 | bwd_microstep: 1769.92 | bwd_inner_microstep: 1539.14 | bwd_allreduce_microstep: 230.67 | step_microstep: 9.70
[2025-05-19 02:30:26,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1649.93 | bwd: 3404.01 | bwd_inner: 3173.10 | bwd_allreduce: 230.75 | step: 9.80
 30%|███       | 135/450 [12:08<26:47,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:30:28,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 891.34 | bwd_microstep: 1744.42 | bwd_inner_microstep: 1744.28 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1329
[2025-05-19 02:30:31,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.62 | optimizer_step: 0.33
[2025-05-19 02:30:31,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 805.02 | bwd_microstep: 1725.08 | bwd_inner_microstep: 1501.63 | bwd_allreduce_microstep: 223.37 | step_microstep: 9.67
[2025-05-19 02:30:31,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1696.33 | bwd: 3469.53 | bwd_inner: 3245.95 | bwd_allreduce: 223.43 | step: 9.77
 30%|███       | 136/450 [12:13<26:53,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1305
[2025-05-19 02:30:33,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 764.96 | bwd_microstep: 1460.41 | bwd_inner_microstep: 1460.25 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:30:36,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.11 | optimizer_gradients: 0.57 | optimizer_step: 0.33
[2025-05-19 02:30:36,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 906.02 | bwd_microstep: 1745.86 | bwd_inner_microstep: 1738.19 | bwd_allreduce_microstep: 7.57 | step_microstep: 9.65
[2025-05-19 02:30:36,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1670.95 | bwd: 3206.30 | bwd_inner: 3198.52 | bwd_allreduce: 7.64 | step: 9.76
 30%|███       | 137/450 [12:18<26:28,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1315
[2025-05-19 02:30:38,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 782.70 | bwd_microstep: 1484.32 | bwd_inner_microstep: 1484.15 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1374
[2025-05-19 02:30:41,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.96 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:30:41,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 825.29 | bwd_microstep: 1809.87 | bwd_inner_microstep: 1558.48 | bwd_allreduce_microstep: 251.28 | step_microstep: 9.46
[2025-05-19 02:30:41,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1607.95 | bwd: 3294.22 | bwd_inner: 3042.71 | bwd_allreduce: 251.35 | step: 9.57
 31%|███       | 138/450 [12:23<26:12,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1323
[2025-05-19 02:30:43,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 783.34 | bwd_microstep: 1496.98 | bwd_inner_microstep: 1496.84 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1445
[2025-05-19 02:30:46,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.75 | optimizer_gradients: 0.72 | optimizer_step: 0.34
[2025-05-19 02:30:46,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 870.40 | bwd_microstep: 1841.20 | bwd_inner_microstep: 1663.90 | bwd_allreduce_microstep: 177.19 | step_microstep: 11.49
[2025-05-19 02:30:46,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1653.72 | bwd: 3338.20 | bwd_inner: 3160.80 | bwd_allreduce: 177.25 | step: 11.58
 31%|███       | 139/450 [12:28<26:08,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1365
[2025-05-19 02:30:48,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 807.72 | bwd_microstep: 1551.06 | bwd_inner_microstep: 1550.89 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:30:51,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.58 | optimizer_step: 0.34
[2025-05-19 02:30:51,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 906.84 | bwd_microstep: 1752.38 | bwd_inner_microstep: 1744.73 | bwd_allreduce_microstep: 7.56 | step_microstep: 11.08
[2025-05-19 02:30:51,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1714.53 | bwd: 3303.47 | bwd_inner: 3295.70 | bwd_allreduce: 7.62 | step: 11.19
 31%|███       | 140/450 [12:33<26:05,  5.05s/it]                                                 {'loss': 1.125, 'grad_norm': 0.3823685646057129, 'learning_rate': 3.230818176246986e-05, 'epoch': 0.31}
 31%|███       | 140/450 [12:33<26:05,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1412
[2025-05-19 02:30:53,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 844.41 | bwd_microstep: 1618.65 | bwd_inner_microstep: 1618.51 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1205
[2025-05-19 02:30:56,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.63 | optimizer_step: 0.33
[2025-05-19 02:30:56,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.23 | bwd_microstep: 1611.49 | bwd_inner_microstep: 1321.36 | bwd_allreduce_microstep: 290.01 | step_microstep: 10.74
[2025-05-19 02:30:56,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1571.62 | bwd: 3230.16 | bwd_inner: 2939.96 | bwd_allreduce: 290.07 | step: 10.84
 31%|███▏      | 141/450 [12:38<25:42,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1444
[2025-05-19 02:30:58,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 856.51 | bwd_microstep: 1669.94 | bwd_inner_microstep: 1669.77 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1457
[2025-05-19 02:31:01,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.73 | optimizer_gradients: 0.66 | optimizer_step: 0.35
[2025-05-19 02:31:01,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 882.29 | bwd_microstep: 1684.52 | bwd_inner_microstep: 1676.57 | bwd_allreduce_microstep: 7.83 | step_microstep: 11.51
[2025-05-19 02:31:01,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1738.77 | bwd: 3354.48 | bwd_inner: 3346.43 | bwd_allreduce: 7.91 | step: 11.61
 32%|███▏      | 142/450 [12:43<25:52,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1400
[2025-05-19 02:31:03,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 824.83 | bwd_microstep: 1594.22 | bwd_inner_microstep: 1594.07 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1325
[2025-05-19 02:31:06,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:31:06,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 800.99 | bwd_microstep: 1499.95 | bwd_inner_microstep: 1492.18 | bwd_allreduce_microstep: 7.66 | step_microstep: 9.57
[2025-05-19 02:31:06,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1625.79 | bwd: 3094.19 | bwd_inner: 3086.32 | bwd_allreduce: 7.72 | step: 9.66
 32%|███▏      | 143/450 [12:48<25:22,  4.96s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1217
[2025-05-19 02:31:08,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.45 | bwd_microstep: 1352.15 | bwd_inner_microstep: 1352.01 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1334
[2025-05-19 02:31:11,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:31:11,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 806.11 | bwd_microstep: 2055.35 | bwd_inner_microstep: 1510.48 | bwd_allreduce_microstep: 544.75 | step_microstep: 10.78
[2025-05-19 02:31:11,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1532.52 | bwd: 3407.52 | bwd_inner: 2862.56 | bwd_allreduce: 544.81 | step: 10.87
 32%|███▏      | 144/450 [12:53<25:20,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1435
[2025-05-19 02:31:13,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 850.94 | bwd_microstep: 1652.92 | bwd_inner_microstep: 1652.76 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1297
[2025-05-19 02:31:16,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:31:16,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 781.67 | bwd_microstep: 1803.72 | bwd_inner_microstep: 1455.04 | bwd_allreduce_microstep: 348.55 | step_microstep: 9.55
[2025-05-19 02:31:16,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1632.59 | bwd: 3456.66 | bwd_inner: 3107.90 | bwd_allreduce: 348.61 | step: 9.66
 32%|███▏      | 145/450 [12:58<25:31,  5.02s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1386
[2025-05-19 02:31:18,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 819.45 | bwd_microstep: 1585.65 | bwd_inner_microstep: 1585.49 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:31:21,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.61 | optimizer_step: 0.39
[2025-05-19 02:31:21,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 909.56 | bwd_microstep: 1757.74 | bwd_inner_microstep: 1750.10 | bwd_allreduce_microstep: 7.54 | step_microstep: 9.82
[2025-05-19 02:31:21,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1728.98 | bwd: 3343.42 | bwd_inner: 3335.65 | bwd_allreduce: 7.60 | step: 9.93
 32%|███▏      | 146/450 [13:03<25:36,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1368
[2025-05-19 02:31:23,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 807.80 | bwd_microstep: 1553.53 | bwd_inner_microstep: 1553.35 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:31:26,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:31:26,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 909.57 | bwd_microstep: 1757.73 | bwd_inner_microstep: 1749.72 | bwd_allreduce_microstep: 7.89 | step_microstep: 9.43
[2025-05-19 02:31:26,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1717.33 | bwd: 3311.29 | bwd_inner: 3303.18 | bwd_allreduce: 7.95 | step: 9.54
 33%|███▎      | 147/450 [13:08<25:33,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1484
[2025-05-19 02:31:29,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 890.93 | bwd_microstep: 1728.67 | bwd_inner_microstep: 1728.52 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1341
[2025-05-19 02:31:31,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.72 | optimizer_gradients: 0.72 | optimizer_step: 0.33
[2025-05-19 02:31:31,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 810.12 | bwd_microstep: 1590.39 | bwd_inner_microstep: 1516.31 | bwd_allreduce_microstep: 73.97 | step_microstep: 11.23
[2025-05-19 02:31:31,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1701.03 | bwd: 3319.09 | bwd_inner: 3244.89 | bwd_allreduce: 74.04 | step: 11.33
 33%|███▎      | 148/450 [13:13<25:29,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1351
[2025-05-19 02:31:33,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 803.48 | bwd_microstep: 1544.63 | bwd_inner_microstep: 1544.49 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1398
[2025-05-19 02:31:36,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:31:36,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 839.59 | bwd_microstep: 1811.84 | bwd_inner_microstep: 1588.56 | bwd_allreduce_microstep: 223.16 | step_microstep: 9.52
[2025-05-19 02:31:36,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1643.06 | bwd: 3356.50 | bwd_inner: 3133.15 | bwd_allreduce: 223.21 | step: 9.62
 33%|███▎      | 149/450 [13:18<25:23,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1360
[2025-05-19 02:31:38,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 798.27 | bwd_microstep: 1546.04 | bwd_inner_microstep: 1545.91 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1422
[2025-05-19 02:31:41,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.72 | optimizer_gradients: 0.73 | optimizer_step: 0.34
[2025-05-19 02:31:41,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 858.21 | bwd_microstep: 1766.41 | bwd_inner_microstep: 1635.02 | bwd_allreduce_microstep: 131.26 | step_microstep: 11.69
[2025-05-19 02:31:41,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1656.44 | bwd: 3312.48 | bwd_inner: 3180.98 | bwd_allreduce: 131.32 | step: 11.79
 33%|███▎      | 150/450 [13:23<25:15,  5.05s/it]                                                 {'loss': 1.1237, 'grad_norm': 0.3526233434677124, 'learning_rate': 3.114134233674888e-05, 'epoch': 0.33}
 33%|███▎      | 150/450 [13:23<25:15,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1357
[2025-05-19 02:31:44,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 804.22 | bwd_microstep: 1555.98 | bwd_inner_microstep: 1555.81 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1176
[2025-05-19 02:31:46,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:31:46,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.38 | bwd_microstep: 1779.92 | bwd_inner_microstep: 1274.34 | bwd_allreduce_microstep: 505.47 | step_microstep: 10.80
[2025-05-19 02:31:46,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1507.57 | bwd: 3335.92 | bwd_inner: 2830.23 | bwd_allreduce: 505.54 | step: 10.92
 34%|███▎      | 151/450 [13:28<24:56,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:31:49,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 896.60 | bwd_microstep: 1741.49 | bwd_inner_microstep: 1741.34 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1423
[2025-05-19 02:31:51,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:31:51,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 866.27 | bwd_microstep: 1808.01 | bwd_inner_microstep: 1644.51 | bwd_allreduce_microstep: 163.36 | step_microstep: 9.67
[2025-05-19 02:31:51,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1762.85 | bwd: 3549.52 | bwd_inner: 3385.95 | bwd_allreduce: 163.43 | step: 9.79
 34%|███▍      | 152/450 [13:34<25:23,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1307
[2025-05-19 02:31:54,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 770.12 | bwd_microstep: 1463.41 | bwd_inner_microstep: 1463.23 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:31:57,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:31:57,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 908.01 | bwd_microstep: 1935.97 | bwd_inner_microstep: 1749.81 | bwd_allreduce_microstep: 186.03 | step_microstep: 9.54
[2025-05-19 02:31:57,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1678.10 | bwd: 3399.41 | bwd_inner: 3213.13 | bwd_allreduce: 186.12 | step: 9.66
 34%|███▍      | 153/450 [13:39<25:20,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1460
[2025-05-19 02:31:59,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 867.55 | bwd_microstep: 1691.37 | bwd_inner_microstep: 1691.20 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:32:02,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.74 | optimizer_step: 0.34
[2025-05-19 02:32:02,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 910.48 | bwd_microstep: 1747.25 | bwd_inner_microstep: 1739.62 | bwd_allreduce_microstep: 7.50 | step_microstep: 9.88
[2025-05-19 02:32:02,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1778.00 | bwd: 3438.65 | bwd_inner: 3430.93 | bwd_allreduce: 7.57 | step: 9.99
 34%|███▍      | 154/450 [13:44<25:28,  5.16s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1275
[2025-05-19 02:32:04,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 748.31 | bwd_microstep: 1412.70 | bwd_inner_microstep: 1412.55 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1312
[2025-05-19 02:32:07,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.90 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:32:07,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 779.27 | bwd_microstep: 1866.51 | bwd_inner_microstep: 1450.45 | bwd_allreduce_microstep: 415.96 | step_microstep: 10.14
[2025-05-19 02:32:07,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1527.56 | bwd: 3279.23 | bwd_inner: 2863.08 | bwd_allreduce: 416.02 | step: 10.24
 34%|███▍      | 155/450 [13:49<24:56,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1301
[2025-05-19 02:32:09,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 765.01 | bwd_microstep: 1457.91 | bwd_inner_microstep: 1457.77 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1056
[2025-05-19 02:32:12,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:32:12,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 630.91 | bwd_microstep: 2126.41 | bwd_inner_microstep: 1108.66 | bwd_allreduce_microstep: 1017.64 | step_microstep: 9.55
[2025-05-19 02:32:12,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1395.89 | bwd: 3584.35 | bwd_inner: 2566.50 | bwd_allreduce: 1017.70 | step: 9.64
 35%|███▍      | 156/450 [13:54<24:47,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1364
[2025-05-19 02:32:14,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 808.11 | bwd_microstep: 1548.26 | bwd_inner_microstep: 1548.12 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1365
[2025-05-19 02:32:17,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.64 | optimizer_step: 0.31
[2025-05-19 02:32:17,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 819.66 | bwd_microstep: 1992.98 | bwd_inner_microstep: 1555.02 | bwd_allreduce_microstep: 437.85 | step_microstep: 9.49
[2025-05-19 02:32:17,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1627.74 | bwd: 3541.27 | bwd_inner: 3103.23 | bwd_allreduce: 437.91 | step: 9.60
 35%|███▍      | 157/450 [13:59<24:56,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1306
[2025-05-19 02:32:19,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 766.62 | bwd_microstep: 1460.10 | bwd_inner_microstep: 1459.95 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1471
[2025-05-19 02:32:22,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.19 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:32:22,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 888.91 | bwd_microstep: 1704.60 | bwd_inner_microstep: 1689.48 | bwd_allreduce_microstep: 15.02 | step_microstep: 11.13
[2025-05-19 02:32:22,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1655.49 | bwd: 3164.72 | bwd_inner: 3149.51 | bwd_allreduce: 15.08 | step: 11.24
 35%|███▌      | 158/450 [14:04<24:31,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1417
[2025-05-19 02:32:24,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 842.14 | bwd_microstep: 1631.63 | bwd_inner_microstep: 1631.49 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1257
[2025-05-19 02:32:27,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:32:27,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 752.20 | bwd_microstep: 1593.78 | bwd_inner_microstep: 1390.53 | bwd_allreduce_microstep: 203.12 | step_microstep: 9.47
[2025-05-19 02:32:27,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1594.32 | bwd: 3225.43 | bwd_inner: 3022.12 | bwd_allreduce: 203.18 | step: 9.57
 35%|███▌      | 159/450 [14:09<24:11,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1489
[2025-05-19 02:32:29,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 886.26 | bwd_microstep: 1729.03 | bwd_inner_microstep: 1728.87 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1224
[2025-05-19 02:32:32,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:32:32,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.30 | bwd_microstep: 1535.93 | bwd_inner_microstep: 1346.15 | bwd_allreduce_microstep: 189.66 | step_microstep: 11.00
[2025-05-19 02:32:32,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1620.53 | bwd: 3264.98 | bwd_inner: 3075.11 | bwd_allreduce: 189.71 | step: 11.10
 36%|███▌      | 160/450 [14:14<24:02,  4.97s/it]                                                 {'loss': 1.1176, 'grad_norm': 0.4145394563674927, 'learning_rate': 2.9916683138830297e-05, 'epoch': 0.36}
 36%|███▌      | 160/450 [14:14<24:02,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1471
[2025-05-19 02:32:34,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 870.90 | bwd_microstep: 1688.68 | bwd_inner_microstep: 1688.54 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:32:37,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.12 | optimizer_gradients: 0.58 | optimizer_step: 0.32
[2025-05-19 02:32:37,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 906.47 | bwd_microstep: 1743.16 | bwd_inner_microstep: 1735.53 | bwd_allreduce_microstep: 7.53 | step_microstep: 9.77
[2025-05-19 02:32:37,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1777.34 | bwd: 3431.86 | bwd_inner: 3424.14 | bwd_allreduce: 7.59 | step: 9.87
 36%|███▌      | 161/450 [14:19<24:22,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1292
[2025-05-19 02:32:39,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 764.38 | bwd_microstep: 1444.08 | bwd_inner_microstep: 1443.92 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1262
[2025-05-19 02:32:42,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.65 | optimizer_step: 0.33
[2025-05-19 02:32:42,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 756.74 | bwd_microstep: 2097.10 | bwd_inner_microstep: 1396.89 | bwd_allreduce_microstep: 700.10 | step_microstep: 10.65
[2025-05-19 02:32:42,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1521.07 | bwd: 3541.20 | bwd_inner: 2840.88 | bwd_allreduce: 700.18 | step: 10.74
 36%|███▌      | 162/450 [14:24<24:22,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1435
[2025-05-19 02:32:45,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 849.85 | bwd_microstep: 1651.19 | bwd_inner_microstep: 1651.02 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1324
[2025-05-19 02:32:47,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.61 | optimizer_step: 0.34
[2025-05-19 02:32:47,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 797.68 | bwd_microstep: 1499.09 | bwd_inner_microstep: 1491.44 | bwd_allreduce_microstep: 7.51 | step_microstep: 9.69
[2025-05-19 02:32:47,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1647.50 | bwd: 3150.30 | bwd_inner: 3142.58 | bwd_allreduce: 7.57 | step: 9.80
 36%|███▌      | 163/450 [14:29<23:57,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1326
[2025-05-19 02:32:49,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 787.53 | bwd_microstep: 1495.34 | bwd_inner_microstep: 1495.20 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1248
[2025-05-19 02:32:52,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:32:52,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.84 | bwd_microstep: 2271.67 | bwd_inner_microstep: 1374.72 | bwd_allreduce_microstep: 896.83 | step_microstep: 9.66
[2025-05-19 02:32:52,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1533.34 | bwd: 3767.03 | bwd_inner: 2870.01 | bwd_allreduce: 896.89 | step: 9.76
 36%|███▋      | 164/450 [14:34<24:22,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1473
[2025-05-19 02:32:55,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 879.35 | bwd_microstep: 1722.16 | bwd_inner_microstep: 1722.00 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1411
[2025-05-19 02:32:57,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.89 | optimizer_gradients: 0.60 | optimizer_step: 0.33
[2025-05-19 02:32:57,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 854.47 | bwd_microstep: 1623.15 | bwd_inner_microstep: 1615.22 | bwd_allreduce_microstep: 7.82 | step_microstep: 10.15
[2025-05-19 02:32:57,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1733.79 | bwd: 3345.34 | bwd_inner: 3337.29 | bwd_allreduce: 7.89 | step: 10.26
 37%|███▋      | 165/450 [14:40<24:18,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1321
[2025-05-19 02:33:00,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 786.66 | bwd_microstep: 1490.86 | bwd_inner_microstep: 1490.73 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:33:03,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.64 | optimizer_step: 0.31
[2025-05-19 02:33:03,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 907.75 | bwd_microstep: 2117.83 | bwd_inner_microstep: 1742.06 | bwd_allreduce_microstep: 375.68 | step_microstep: 9.67
[2025-05-19 02:33:03,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1694.39 | bwd: 3608.72 | bwd_inner: 3232.85 | bwd_allreduce: 375.74 | step: 9.77
 37%|███▋      | 166/450 [14:45<24:34,  5.19s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1267
[2025-05-19 02:33:05,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.84 | bwd_microstep: 1412.11 | bwd_inner_microstep: 1411.95 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1319
[2025-05-19 02:33:08,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.86 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:33:08,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 797.18 | bwd_microstep: 2055.21 | bwd_inner_microstep: 1484.93 | bwd_allreduce_microstep: 570.18 | step_microstep: 10.30
[2025-05-19 02:33:08,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1540.99 | bwd: 3467.36 | bwd_inner: 2896.95 | bwd_allreduce: 570.25 | step: 10.41
 37%|███▋      | 167/450 [14:50<24:17,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1346
[2025-05-19 02:33:10,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 799.63 | bwd_microstep: 1543.07 | bwd_inner_microstep: 1542.93 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1301
[2025-05-19 02:33:13,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:33:13,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 780.26 | bwd_microstep: 1973.07 | bwd_inner_microstep: 1456.28 | bwd_allreduce_microstep: 516.66 | step_microstep: 10.64
[2025-05-19 02:33:13,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1579.86 | bwd: 3516.17 | bwd_inner: 2999.32 | bwd_allreduce: 516.72 | step: 10.77
 37%|███▋      | 168/450 [14:55<24:12,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1395
[2025-05-19 02:33:15,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 822.71 | bwd_microstep: 1595.10 | bwd_inner_microstep: 1594.94 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1320
[2025-05-19 02:33:18,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:33:18,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 788.06 | bwd_microstep: 1879.34 | bwd_inner_microstep: 1481.83 | bwd_allreduce_microstep: 397.39 | step_microstep: 9.50
[2025-05-19 02:33:18,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1610.73 | bwd: 3474.47 | bwd_inner: 3076.87 | bwd_allreduce: 397.45 | step: 9.61
 38%|███▊      | 169/450 [15:00<24:06,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1304
[2025-05-19 02:33:20,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 759.07 | bwd_microstep: 1450.69 | bwd_inner_microstep: 1450.51 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1196
[2025-05-19 02:33:23,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:33:23,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.08 | bwd_microstep: 2281.76 | bwd_inner_microstep: 1309.80 | bwd_allreduce_microstep: 971.83 | step_microstep: 9.48
[2025-05-19 02:33:23,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1476.10 | bwd: 3732.48 | bwd_inner: 2760.42 | bwd_allreduce: 971.90 | step: 9.59
 38%|███▊      | 170/450 [15:05<24:10,  5.18s/it]                                                 {'loss': 1.1275, 'grad_norm': 0.3927198052406311, 'learning_rate': 2.8640559732855595e-05, 'epoch': 0.38}
 38%|███▊      | 170/450 [15:05<24:10,  5.18s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1264
[2025-05-19 02:33:25,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.00 | bwd_microstep: 1393.04 | bwd_inner_microstep: 1392.90 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1291
[2025-05-19 02:33:28,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:33:28,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 775.45 | bwd_microstep: 2061.17 | bwd_inner_microstep: 1439.27 | bwd_allreduce_microstep: 621.78 | step_microstep: 9.47
[2025-05-19 02:33:28,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1508.41 | bwd: 3454.24 | bwd_inner: 2832.26 | bwd_allreduce: 621.84 | step: 9.57
 38%|███▊      | 171/450 [15:11<23:51,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1351
[2025-05-19 02:33:31,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 801.03 | bwd_microstep: 1549.33 | bwd_inner_microstep: 1549.19 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1426
[2025-05-19 02:33:33,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:33:33,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 860.21 | bwd_microstep: 1691.32 | bwd_inner_microstep: 1629.31 | bwd_allreduce_microstep: 61.91 | step_microstep: 9.61
[2025-05-19 02:33:33,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1661.21 | bwd: 3240.67 | bwd_inner: 3178.57 | bwd_allreduce: 61.97 | step: 9.73
 38%|███▊      | 172/450 [15:15<23:31,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1419
[2025-05-19 02:33:36,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 840.91 | bwd_microstep: 1631.19 | bwd_inner_microstep: 1631.05 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1415
[2025-05-19 02:33:38,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.60 | optimizer_step: 0.31
[2025-05-19 02:33:38,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 852.68 | bwd_microstep: 1632.14 | bwd_inner_microstep: 1624.53 | bwd_allreduce_microstep: 7.52 | step_microstep: 9.54
[2025-05-19 02:33:38,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1693.57 | bwd: 3263.36 | bwd_inner: 3255.64 | bwd_allreduce: 7.58 | step: 9.64
 38%|███▊      | 173/450 [15:20<23:21,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1415
[2025-05-19 02:33:41,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 845.76 | bwd_microstep: 1631.25 | bwd_inner_microstep: 1631.07 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1272
[2025-05-19 02:33:44,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:33:44,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 750.35 | bwd_microstep: 2086.56 | bwd_inner_microstep: 1405.55 | bwd_allreduce_microstep: 680.91 | step_microstep: 10.76
[2025-05-19 02:33:44,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1596.08 | bwd: 3717.84 | bwd_inner: 3036.68 | bwd_allreduce: 680.98 | step: 10.89
 39%|███▊      | 174/450 [15:26<23:41,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1308
[2025-05-19 02:33:46,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 768.03 | bwd_microstep: 1460.19 | bwd_inner_microstep: 1460.02 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1334
[2025-05-19 02:33:49,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.13 | optimizer_gradients: 0.67 | optimizer_step: 0.32
[2025-05-19 02:33:49,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 802.28 | bwd_microstep: 2043.66 | bwd_inner_microstep: 1511.35 | bwd_allreduce_microstep: 532.20 | step_microstep: 10.74
[2025-05-19 02:33:49,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1570.28 | bwd: 3503.88 | bwd_inner: 2971.46 | bwd_allreduce: 532.27 | step: 10.87
 39%|███▉      | 175/450 [15:31<23:34,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1354
[2025-05-19 02:33:51,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 803.11 | bwd_microstep: 1553.64 | bwd_inner_microstep: 1553.50 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:33:54,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:33:54,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 910.78 | bwd_microstep: 1791.48 | bwd_inner_microstep: 1741.68 | bwd_allreduce_microstep: 49.68 | step_microstep: 9.56
[2025-05-19 02:33:54,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1713.85 | bwd: 3345.14 | bwd_inner: 3295.27 | bwd_allreduce: 49.74 | step: 9.66
 39%|███▉      | 176/450 [15:36<23:26,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1305
[2025-05-19 02:33:56,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 767.68 | bwd_microstep: 1467.08 | bwd_inner_microstep: 1466.92 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1414
[2025-05-19 02:33:59,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.66 | optimizer_step: 0.32
[2025-05-19 02:33:59,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 851.95 | bwd_microstep: 1736.36 | bwd_inner_microstep: 1628.10 | bwd_allreduce_microstep: 108.14 | step_microstep: 11.13
[2025-05-19 02:33:59,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1619.59 | bwd: 3203.47 | bwd_inner: 3095.10 | bwd_allreduce: 108.21 | step: 11.25
 39%|███▉      | 177/450 [15:41<23:00,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:34:01,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 892.94 | bwd_microstep: 1741.41 | bwd_inner_microstep: 1741.27 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1374
[2025-05-19 02:34:04,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.60 | optimizer_step: 0.34
[2025-05-19 02:34:04,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 830.73 | bwd_microstep: 1565.40 | bwd_inner_microstep: 1557.69 | bwd_allreduce_microstep: 7.58 | step_microstep: 9.56
[2025-05-19 02:34:04,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1723.62 | bwd: 3306.84 | bwd_inner: 3299.06 | bwd_allreduce: 7.64 | step: 9.66
 40%|███▉      | 178/450 [15:46<22:57,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1318
[2025-05-19 02:34:06,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 784.84 | bwd_microstep: 1480.71 | bwd_inner_microstep: 1480.56 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:34:09,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:34:09,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 910.00 | bwd_microstep: 1761.76 | bwd_inner_microstep: 1738.99 | bwd_allreduce_microstep: 22.67 | step_microstep: 10.97
[2025-05-19 02:34:09,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1694.82 | bwd: 3242.49 | bwd_inner: 3219.60 | bwd_allreduce: 22.74 | step: 11.07
 40%|███▉      | 179/450 [15:51<22:46,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:34:12,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 893.08 | bwd_microstep: 1739.59 | bwd_inner_microstep: 1739.45 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1485
[2025-05-19 02:34:14,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.75 | optimizer_gradients: 0.65 | optimizer_step: 0.33
[2025-05-19 02:34:14,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 902.67 | bwd_microstep: 1742.49 | bwd_inner_microstep: 1734.66 | bwd_allreduce_microstep: 7.72 | step_microstep: 11.17
[2025-05-19 02:34:14,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1795.72 | bwd: 3482.11 | bwd_inner: 3474.17 | bwd_allreduce: 7.78 | step: 11.27
 40%|████      | 180/450 [15:56<23:05,  5.13s/it]                                                 {'loss': 1.0771, 'grad_norm': 0.3606317937374115, 'learning_rate': 2.7319594764678325e-05, 'epoch': 0.4}
 40%|████      | 180/450 [15:56<23:05,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1272
[2025-05-19 02:34:16,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.48 | bwd_microstep: 1403.85 | bwd_inner_microstep: 1403.71 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:34:19,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.66 | optimizer_step: 0.31
[2025-05-19 02:34:19,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 910.69 | bwd_microstep: 1973.58 | bwd_inner_microstep: 1743.65 | bwd_allreduce_microstep: 229.82 | step_microstep: 9.58
[2025-05-19 02:34:19,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1653.14 | bwd: 3377.45 | bwd_inner: 3147.42 | bwd_allreduce: 229.88 | step: 9.68
 40%|████      | 181/450 [16:01<22:56,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1412
[2025-05-19 02:34:22,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 835.42 | bwd_microstep: 1640.38 | bwd_inner_microstep: 1640.22 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1316
[2025-05-19 02:34:25,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.65 | optimizer_step: 0.33
[2025-05-19 02:34:25,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 792.63 | bwd_microstep: 2063.36 | bwd_inner_microstep: 1490.71 | bwd_allreduce_microstep: 572.55 | step_microstep: 9.53
[2025-05-19 02:34:25,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1628.02 | bwd: 3703.78 | bwd_inner: 3131.01 | bwd_allreduce: 572.60 | step: 9.64
 40%|████      | 182/450 [16:07<23:12,  5.20s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1217
[2025-05-19 02:34:27,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.32 | bwd_microstep: 1352.60 | bwd_inner_microstep: 1352.43 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1219
[2025-05-19 02:34:30,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:34:30,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.76 | bwd_microstep: 2205.58 | bwd_inner_microstep: 1351.92 | bwd_allreduce_microstep: 853.54 | step_microstep: 9.57
[2025-05-19 02:34:30,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1456.06 | bwd: 3558.20 | bwd_inner: 2704.44 | bwd_allreduce: 853.61 | step: 9.68
 41%|████      | 183/450 [16:12<22:57,  5.16s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1372
[2025-05-19 02:34:32,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 807.96 | bwd_microstep: 1553.59 | bwd_inner_microstep: 1553.44 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1407
[2025-05-19 02:34:35,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.58 | optimizer_step: 0.33
[2025-05-19 02:34:35,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 843.51 | bwd_microstep: 1707.97 | bwd_inner_microstep: 1595.22 | bwd_allreduce_microstep: 112.62 | step_microstep: 11.15
[2025-05-19 02:34:35,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1651.44 | bwd: 3261.58 | bwd_inner: 3148.76 | bwd_allreduce: 112.68 | step: 11.24
 41%|████      | 184/450 [16:17<22:36,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1457
[2025-05-19 02:34:37,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 865.62 | bwd_microstep: 1674.99 | bwd_inner_microstep: 1674.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1450
[2025-05-19 02:34:40,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.72 | optimizer_gradients: 0.69 | optimizer_step: 0.33
[2025-05-19 02:34:40,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 874.53 | bwd_microstep: 1672.41 | bwd_inner_microstep: 1664.59 | bwd_allreduce_microstep: 7.70 | step_microstep: 11.17
[2025-05-19 02:34:40,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1740.12 | bwd: 3347.43 | bwd_inner: 3339.52 | bwd_allreduce: 7.76 | step: 11.28
 41%|████      | 185/450 [16:22<22:35,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1169
[2025-05-19 02:34:42,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.69 | bwd_microstep: 1272.72 | bwd_inner_microstep: 1272.54 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1373
[2025-05-19 02:34:45,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:34:45,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 825.72 | bwd_microstep: 2444.26 | bwd_inner_microstep: 1563.60 | bwd_allreduce_microstep: 880.54 | step_microstep: 9.55
[2025-05-19 02:34:45,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1518.38 | bwd: 3717.00 | bwd_inner: 2836.23 | bwd_allreduce: 880.59 | step: 9.66
 41%|████▏     | 186/450 [16:27<22:43,  5.17s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1220
[2025-05-19 02:34:47,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.50 | bwd_microstep: 1352.11 | bwd_inner_microstep: 1351.95 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1340
[2025-05-19 02:34:50,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.62 | optimizer_step: 0.34
[2025-05-19 02:34:50,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 807.82 | bwd_microstep: 1905.03 | bwd_inner_microstep: 1508.08 | bwd_allreduce_microstep: 396.82 | step_microstep: 10.41
[2025-05-19 02:34:50,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1528.29 | bwd: 3257.18 | bwd_inner: 2860.12 | bwd_allreduce: 396.87 | step: 10.52
 42%|████▏     | 187/450 [16:32<22:12,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1426
[2025-05-19 02:34:52,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 844.70 | bwd_microstep: 1638.60 | bwd_inner_microstep: 1638.42 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1414
[2025-05-19 02:34:55,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.59 | optimizer_step: 0.34
[2025-05-19 02:34:55,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 853.47 | bwd_microstep: 1636.38 | bwd_inner_microstep: 1628.74 | bwd_allreduce_microstep: 7.51 | step_microstep: 9.63
[2025-05-19 02:34:55,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1698.15 | bwd: 3275.00 | bwd_inner: 3267.25 | bwd_allreduce: 7.58 | step: 9.73
 42%|████▏     | 188/450 [16:37<22:04,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1400
[2025-05-19 02:34:57,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 824.84 | bwd_microstep: 1586.93 | bwd_inner_microstep: 1586.79 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1439
[2025-05-19 02:35:00,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:35:00,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 867.18 | bwd_microstep: 1895.62 | bwd_inner_microstep: 1642.66 | bwd_allreduce_microstep: 252.86 | step_microstep: 9.48
[2025-05-19 02:35:00,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1692.00 | bwd: 3482.58 | bwd_inner: 3229.51 | bwd_allreduce: 252.92 | step: 9.57
 42%|████▏     | 189/450 [16:42<22:12,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:35:03,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 893.25 | bwd_microstep: 1752.48 | bwd_inner_microstep: 1752.31 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1269
[2025-05-19 02:35:05,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:35:05,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 759.57 | bwd_microstep: 1762.99 | bwd_inner_microstep: 1405.36 | bwd_allreduce_microstep: 357.51 | step_microstep: 10.89
[2025-05-19 02:35:05,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1652.80 | bwd: 3515.51 | bwd_inner: 3157.77 | bwd_allreduce: 357.57 | step: 11.00
 42%|████▏     | 190/450 [16:48<22:16,  5.14s/it]                                                 {'loss': 1.1236, 'grad_norm': 0.3853040337562561, 'learning_rate': 2.5960643592587676e-05, 'epoch': 0.42}
 42%|████▏     | 190/450 [16:48<22:16,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1401
[2025-05-19 02:35:08,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 827.90 | bwd_microstep: 1587.51 | bwd_inner_microstep: 1587.33 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1357
[2025-05-19 02:35:11,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.97 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:35:11,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 818.41 | bwd_microstep: 1864.57 | bwd_inner_microstep: 1549.27 | bwd_allreduce_microstep: 315.20 | step_microstep: 9.49
[2025-05-19 02:35:11,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1646.26 | bwd: 3452.11 | bwd_inner: 3136.67 | bwd_allreduce: 315.28 | step: 9.60
 42%|████▏     | 191/450 [16:53<22:12,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1380
[2025-05-19 02:35:13,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 813.86 | bwd_microstep: 1582.93 | bwd_inner_microstep: 1582.78 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1240
[2025-05-19 02:35:16,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.65 | optimizer_step: 0.34
[2025-05-19 02:35:16,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.01 | bwd_microstep: 1843.94 | bwd_inner_microstep: 1369.13 | bwd_allreduce_microstep: 474.72 | step_microstep: 9.85
[2025-05-19 02:35:16,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1556.85 | bwd: 3426.89 | bwd_inner: 2951.99 | bwd_allreduce: 474.77 | step: 9.95
 43%|████▎     | 192/450 [16:58<21:59,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:35:18,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 892.48 | bwd_microstep: 1739.00 | bwd_inner_microstep: 1738.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1407
[2025-05-19 02:35:21,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.73 | optimizer_gradients: 0.69 | optimizer_step: 0.33
[2025-05-19 02:35:21,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 847.71 | bwd_microstep: 1604.24 | bwd_inner_microstep: 1596.33 | bwd_allreduce_microstep: 7.77 | step_microstep: 11.05
[2025-05-19 02:35:21,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1740.16 | bwd: 3343.26 | bwd_inner: 3335.28 | bwd_allreduce: 7.83 | step: 11.16
 43%|████▎     | 193/450 [17:03<21:55,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1307
[2025-05-19 02:35:23,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 774.48 | bwd_microstep: 1465.97 | bwd_inner_microstep: 1465.81 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1435
[2025-05-19 02:35:26,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:35:26,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 865.33 | bwd_microstep: 2006.94 | bwd_inner_microstep: 1651.84 | bwd_allreduce_microstep: 355.00 | step_microstep: 11.00
[2025-05-19 02:35:26,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1639.77 | bwd: 3472.95 | bwd_inner: 3117.71 | bwd_allreduce: 355.08 | step: 11.10
 43%|████▎     | 194/450 [17:08<21:54,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1443
[2025-05-19 02:35:28,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 853.37 | bwd_microstep: 1662.45 | bwd_inner_microstep: 1662.29 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:35:31,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.60 | optimizer_step: 0.36
[2025-05-19 02:35:31,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 912.41 | bwd_microstep: 1751.68 | bwd_inner_microstep: 1743.91 | bwd_allreduce_microstep: 7.67 | step_microstep: 9.72
[2025-05-19 02:35:31,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1765.77 | bwd: 3414.16 | bwd_inner: 3406.27 | bwd_allreduce: 7.73 | step: 9.81
 43%|████▎     | 195/450 [17:13<21:56,  5.16s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1499
[2025-05-19 02:35:34,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 897.35 | bwd_microstep: 1749.38 | bwd_inner_microstep: 1749.24 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1390
[2025-05-19 02:35:36,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.74 | optimizer_gradients: 0.69 | optimizer_step: 0.32
[2025-05-19 02:35:36,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 841.65 | bwd_microstep: 1625.44 | bwd_inner_microstep: 1587.12 | bwd_allreduce_microstep: 38.17 | step_microstep: 11.13
[2025-05-19 02:35:36,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1738.98 | bwd: 3374.84 | bwd_inner: 3336.46 | bwd_allreduce: 38.23 | step: 11.22
 44%|████▎     | 196/450 [17:19<21:51,  5.17s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:35:39,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 892.10 | bwd_microstep: 1752.32 | bwd_inner_microstep: 1752.18 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1336
[2025-05-19 02:35:41,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.61 | optimizer_step: 0.33
[2025-05-19 02:35:41,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 801.70 | bwd_microstep: 1511.11 | bwd_inner_microstep: 1503.33 | bwd_allreduce_microstep: 7.68 | step_microstep: 9.63
[2025-05-19 02:35:41,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1693.77 | bwd: 3263.46 | bwd_inner: 3255.59 | bwd_allreduce: 7.74 | step: 9.73
 44%|████▍     | 197/450 [17:24<21:35,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1418
[2025-05-19 02:35:44,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 843.57 | bwd_microstep: 1633.58 | bwd_inner_microstep: 1633.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1352
[2025-05-19 02:35:46,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:35:46,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 813.10 | bwd_microstep: 1676.92 | bwd_inner_microstep: 1530.41 | bwd_allreduce_microstep: 146.41 | step_microstep: 9.46
[2025-05-19 02:35:46,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1656.65 | bwd: 3310.53 | bwd_inner: 3163.88 | bwd_allreduce: 146.48 | step: 9.56
 44%|████▍     | 198/450 [17:29<21:22,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:35:49,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 895.13 | bwd_microstep: 1754.41 | bwd_inner_microstep: 1754.23 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1457
[2025-05-19 02:35:52,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.19 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:35:52,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 884.53 | bwd_microstep: 1690.33 | bwd_inner_microstep: 1682.59 | bwd_allreduce_microstep: 7.64 | step_microstep: 11.18
[2025-05-19 02:35:52,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1779.63 | bwd: 3444.76 | bwd_inner: 3436.89 | bwd_allreduce: 7.71 | step: 11.30
 44%|████▍     | 199/450 [17:34<21:31,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1304
[2025-05-19 02:35:54,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 763.86 | bwd_microstep: 1455.56 | bwd_inner_microstep: 1455.38 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1385
[2025-05-19 02:35:56,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:35:56,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 831.88 | bwd_microstep: 1628.29 | bwd_inner_microstep: 1574.98 | bwd_allreduce_microstep: 53.19 | step_microstep: 9.44
[2025-05-19 02:35:56,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1595.70 | bwd: 3083.88 | bwd_inner: 3030.43 | bwd_allreduce: 53.26 | step: 9.54
 44%|████▍     | 200/450 [17:39<20:55,  5.02s/it]                                                 {'loss': 1.1053, 'grad_norm': 0.39287519454956055, 'learning_rate': 2.4570758710333787e-05, 'epoch': 0.44}
 44%|████▍     | 200/450 [17:39<20:55,  5.02s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:35:59,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 893.62 | bwd_microstep: 1748.76 | bwd_inner_microstep: 1748.61 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1400
[2025-05-19 02:36:02,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:36:02,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 839.98 | bwd_microstep: 1846.98 | bwd_inner_microstep: 1584.66 | bwd_allreduce_microstep: 262.22 | step_microstep: 9.49
[2025-05-19 02:36:02,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1733.59 | bwd: 3595.76 | bwd_inner: 3333.34 | bwd_allreduce: 262.28 | step: 9.59
 45%|████▍     | 201/450 [17:44<21:17,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1393
[2025-05-19 02:36:04,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 825.75 | bwd_microstep: 1587.16 | bwd_inner_microstep: 1587.00 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1367
[2025-05-19 02:36:07,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.07 | optimizer_gradients: 0.58 | optimizer_step: 0.32
[2025-05-19 02:36:07,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 821.62 | bwd_microstep: 1560.47 | bwd_inner_microstep: 1552.85 | bwd_allreduce_microstep: 7.52 | step_microstep: 9.54
[2025-05-19 02:36:07,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1647.35 | bwd: 3147.65 | bwd_inner: 3139.91 | bwd_allreduce: 7.58 | step: 9.63
 45%|████▍     | 202/450 [17:49<20:51,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1374
[2025-05-19 02:36:09,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 816.95 | bwd_microstep: 1561.73 | bwd_inner_microstep: 1561.59 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1267
[2025-05-19 02:36:11,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:36:11,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 757.04 | bwd_microstep: 1464.19 | bwd_inner_microstep: 1400.65 | bwd_allreduce_microstep: 63.44 | step_microstep: 9.47
[2025-05-19 02:36:11,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1573.97 | bwd: 3025.95 | bwd_inner: 2962.30 | bwd_allreduce: 63.50 | step: 9.58
 45%|████▌     | 203/450 [17:53<20:17,  4.93s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1326
[2025-05-19 02:36:14,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 784.25 | bwd_microstep: 1499.16 | bwd_inner_microstep: 1499.00 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1384
[2025-05-19 02:36:16,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.11 | optimizer_gradients: 0.62 | optimizer_step: 0.31
[2025-05-19 02:36:16,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 827.08 | bwd_microstep: 1828.09 | bwd_inner_microstep: 1571.69 | bwd_allreduce_microstep: 256.29 | step_microstep: 10.94
[2025-05-19 02:36:16,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1611.29 | bwd: 3327.27 | bwd_inner: 3070.78 | bwd_allreduce: 256.35 | step: 11.04
 45%|████▌     | 204/450 [17:58<20:17,  4.95s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1392
[2025-05-19 02:36:19,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 814.20 | bwd_microstep: 1580.51 | bwd_inner_microstep: 1580.35 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1252
[2025-05-19 02:36:21,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.62 | optimizer_step: 0.33
[2025-05-19 02:36:21,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 750.76 | bwd_microstep: 1952.35 | bwd_inner_microstep: 1388.47 | bwd_allreduce_microstep: 563.77 | step_microstep: 9.49
[2025-05-19 02:36:21,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1564.93 | bwd: 3532.89 | bwd_inner: 2968.88 | bwd_allreduce: 563.84 | step: 9.62
 46%|████▌     | 205/450 [18:04<20:27,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1353
[2025-05-19 02:36:24,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 802.83 | bwd_microstep: 1551.77 | bwd_inner_microstep: 1551.63 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1306
[2025-05-19 02:36:26,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.18 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:36:26,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 785.31 | bwd_microstep: 1826.01 | bwd_inner_microstep: 1457.48 | bwd_allreduce_microstep: 368.40 | step_microstep: 10.96
[2025-05-19 02:36:26,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1588.10 | bwd: 3377.80 | bwd_inner: 3009.20 | bwd_allreduce: 368.46 | step: 11.06
 46%|████▌     | 206/450 [18:09<20:22,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1353
[2025-05-19 02:36:29,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 802.57 | bwd_microstep: 1545.26 | bwd_inner_microstep: 1545.09 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.21
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1341
[2025-05-19 02:36:32,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.66 | optimizer_step: 0.32
[2025-05-19 02:36:32,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 804.81 | bwd_microstep: 1898.84 | bwd_inner_microstep: 1509.49 | bwd_allreduce_microstep: 389.22 | step_microstep: 9.61
[2025-05-19 02:36:32,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1607.35 | bwd: 3444.13 | bwd_inner: 3054.65 | bwd_allreduce: 389.30 | step: 9.85
 46%|████▌     | 207/450 [18:14<20:24,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1389
[2025-05-19 02:36:34,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 823.09 | bwd_microstep: 1589.03 | bwd_inner_microstep: 1588.89 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1371
[2025-05-19 02:36:36,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:36:36,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 826.11 | bwd_microstep: 1623.67 | bwd_inner_microstep: 1557.00 | bwd_allreduce_microstep: 66.58 | step_microstep: 10.76
[2025-05-19 02:36:36,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1649.17 | bwd: 3212.72 | bwd_inner: 3145.96 | bwd_allreduce: 66.63 | step: 10.86
 46%|████▌     | 208/450 [18:19<20:10,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1217
[2025-05-19 02:36:39,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.69 | bwd_microstep: 1353.00 | bwd_inner_microstep: 1352.83 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:36:41,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:36:41,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 907.51 | bwd_microstep: 1916.89 | bwd_inner_microstep: 1740.55 | bwd_allreduce_microstep: 176.24 | step_microstep: 11.03
[2025-05-19 02:36:41,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1630.17 | bwd: 3269.92 | bwd_inner: 3093.45 | bwd_allreduce: 176.31 | step: 11.14
 46%|████▋     | 209/450 [18:24<20:02,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1488
[2025-05-19 02:36:44,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 878.28 | bwd_microstep: 1721.32 | bwd_inner_microstep: 1721.16 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1127
[2025-05-19 02:36:46,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.64 | optimizer_step: 0.31
[2025-05-19 02:36:46,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.57 | bwd_microstep: 1664.79 | bwd_inner_microstep: 1216.33 | bwd_allreduce_microstep: 448.35 | step_microstep: 9.47
[2025-05-19 02:36:46,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1560.82 | bwd: 3386.14 | bwd_inner: 2937.57 | bwd_allreduce: 448.42 | step: 9.59
 47%|████▋     | 210/450 [18:29<19:58,  4.99s/it]                                                 {'loss': 1.1421, 'grad_norm': 0.3903857171535492, 'learning_rate': 2.3157153147087082e-05, 'epoch': 0.47}
 47%|████▋     | 210/450 [18:29<19:58,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1312
[2025-05-19 02:36:49,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 764.68 | bwd_microstep: 1459.72 | bwd_inner_microstep: 1459.58 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1385
[2025-05-19 02:36:52,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:36:52,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 833.79 | bwd_microstep: 2035.36 | bwd_inner_microstep: 1582.53 | bwd_allreduce_microstep: 452.72 | step_microstep: 9.55
[2025-05-19 02:36:52,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1598.45 | bwd: 3495.10 | bwd_inner: 3042.19 | bwd_allreduce: 452.78 | step: 9.65
 47%|████▋     | 211/450 [18:34<20:04,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1478
[2025-05-19 02:36:54,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 882.43 | bwd_microstep: 1730.99 | bwd_inner_microstep: 1730.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1281
[2025-05-19 02:36:57,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.73 | optimizer_gradients: 0.71 | optimizer_step: 0.33
[2025-05-19 02:36:57,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 772.30 | bwd_microstep: 1674.50 | bwd_inner_microstep: 1425.04 | bwd_allreduce_microstep: 249.35 | step_microstep: 11.38
[2025-05-19 02:36:57,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1654.70 | bwd: 3405.52 | bwd_inner: 3155.96 | bwd_allreduce: 249.42 | step: 11.48
 47%|████▋     | 212/450 [18:39<20:04,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1259
[2025-05-19 02:36:59,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.87 | bwd_microstep: 1402.42 | bwd_inner_microstep: 1402.29 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1413
[2025-05-19 02:37:01,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:37:01,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 852.76 | bwd_microstep: 1658.42 | bwd_inner_microstep: 1618.90 | bwd_allreduce_microstep: 39.42 | step_microstep: 9.58
[2025-05-19 02:37:01,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1592.60 | bwd: 3060.87 | bwd_inner: 3021.24 | bwd_allreduce: 39.48 | step: 9.69
 47%|████▋     | 213/450 [18:44<19:34,  4.96s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1479
[2025-05-19 02:37:04,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 884.20 | bwd_microstep: 1731.63 | bwd_inner_microstep: 1731.47 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1340
[2025-05-19 02:37:06,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:37:06,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 807.68 | bwd_microstep: 1542.44 | bwd_inner_microstep: 1510.62 | bwd_allreduce_microstep: 31.72 | step_microstep: 9.55
[2025-05-19 02:37:06,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1691.85 | bwd: 3274.10 | bwd_inner: 3242.17 | bwd_allreduce: 31.79 | step: 9.68
 48%|████▊     | 214/450 [18:49<19:34,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1127
[2025-05-19 02:37:08,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 664.90 | bwd_microstep: 1213.88 | bwd_inner_microstep: 1213.75 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1322
[2025-05-19 02:37:11,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:37:11,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 795.66 | bwd_microstep: 2209.10 | bwd_inner_microstep: 1487.55 | bwd_allreduce_microstep: 721.44 | step_microstep: 9.49
[2025-05-19 02:37:11,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1460.53 | bwd: 3423.01 | bwd_inner: 2701.38 | bwd_allreduce: 721.50 | step: 9.58
 48%|████▊     | 215/450 [18:54<19:26,  4.96s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1374
[2025-05-19 02:37:14,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 810.93 | bwd_microstep: 1560.65 | bwd_inner_microstep: 1560.48 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1367
[2025-05-19 02:37:16,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:37:16,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 825.16 | bwd_microstep: 1695.22 | bwd_inner_microstep: 1554.18 | bwd_allreduce_microstep: 140.88 | step_microstep: 9.48
[2025-05-19 02:37:16,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1636.05 | bwd: 3255.90 | bwd_inner: 3114.79 | bwd_allreduce: 140.94 | step: 9.58
 48%|████▊     | 216/450 [18:58<19:20,  4.96s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1434
[2025-05-19 02:37:19,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 848.52 | bwd_microstep: 1650.04 | bwd_inner_microstep: 1649.90 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1373
[2025-05-19 02:37:21,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:37:21,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 829.58 | bwd_microstep: 1575.37 | bwd_inner_microstep: 1567.75 | bwd_allreduce_microstep: 7.52 | step_microstep: 9.68
[2025-05-19 02:37:21,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1678.07 | bwd: 3225.43 | bwd_inner: 3217.72 | bwd_allreduce: 7.58 | step: 9.78
 48%|████▊     | 217/450 [19:03<19:15,  4.96s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1480
[2025-05-19 02:37:24,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 879.07 | bwd_microstep: 1715.09 | bwd_inner_microstep: 1714.95 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1362
[2025-05-19 02:37:26,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:37:26,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 822.34 | bwd_microstep: 1556.54 | bwd_inner_microstep: 1548.87 | bwd_allreduce_microstep: 7.56 | step_microstep: 10.90
[2025-05-19 02:37:26,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1701.38 | bwd: 3271.65 | bwd_inner: 3263.91 | bwd_allreduce: 7.62 | step: 11.00
 48%|████▊     | 218/450 [19:08<19:14,  4.98s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1309
[2025-05-19 02:37:29,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 773.67 | bwd_microstep: 1466.43 | bwd_inner_microstep: 1466.29 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1298
[2025-05-19 02:37:31,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:37:31,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 778.69 | bwd_microstep: 2152.24 | bwd_inner_microstep: 1448.69 | bwd_allreduce_microstep: 703.41 | step_microstep: 11.15
[2025-05-19 02:37:31,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1552.34 | bwd: 3618.69 | bwd_inner: 2915.05 | bwd_allreduce: 703.44 | step: 11.24
 49%|████▊     | 219/450 [19:14<19:27,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1483
[2025-05-19 02:37:34,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 883.17 | bwd_microstep: 1734.35 | bwd_inner_microstep: 1734.20 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1457
[2025-05-19 02:37:37,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.59 | optimizer_step: 0.32
[2025-05-19 02:37:37,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 884.11 | bwd_microstep: 1687.15 | bwd_inner_microstep: 1679.57 | bwd_allreduce_microstep: 7.49 | step_microstep: 9.59
[2025-05-19 02:37:37,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1767.24 | bwd: 3421.53 | bwd_inner: 3413.82 | bwd_allreduce: 7.56 | step: 9.71
 49%|████▉     | 220/450 [19:19<19:35,  5.11s/it]                                                 {'loss': 1.1233, 'grad_norm': 0.3687730133533478, 'learning_rate': 2.172716303427355e-05, 'epoch': 0.49}
 49%|████▉     | 220/450 [19:19<19:35,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1370
[2025-05-19 02:37:39,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 814.90 | bwd_microstep: 1554.71 | bwd_inner_microstep: 1554.57 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1404
[2025-05-19 02:37:42,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:37:42,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 844.70 | bwd_microstep: 1606.46 | bwd_inner_microstep: 1598.90 | bwd_allreduce_microstep: 7.47 | step_microstep: 9.63
[2025-05-19 02:37:42,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1659.59 | bwd: 3161.19 | bwd_inner: 3153.52 | bwd_allreduce: 7.52 | step: 9.73
 49%|████▉     | 221/450 [19:24<19:13,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1305
[2025-05-19 02:37:44,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 775.86 | bwd_microstep: 1456.56 | bwd_inner_microstep: 1456.42 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1427
[2025-05-19 02:37:46,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.58 | optimizer_step: 0.32
[2025-05-19 02:37:46,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 862.29 | bwd_microstep: 1643.63 | bwd_inner_microstep: 1636.03 | bwd_allreduce_microstep: 7.48 | step_microstep: 9.51
[2025-05-19 02:37:46,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1638.13 | bwd: 3100.22 | bwd_inner: 3092.54 | bwd_allreduce: 7.54 | step: 9.63
 49%|████▉     | 222/450 [19:29<18:51,  4.96s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1317
[2025-05-19 02:37:49,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 787.54 | bwd_microstep: 1486.57 | bwd_inner_microstep: 1486.43 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1371
[2025-05-19 02:37:51,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.66 | optimizer_step: 0.31
[2025-05-19 02:37:51,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 827.55 | bwd_microstep: 1899.57 | bwd_inner_microstep: 1561.16 | bwd_allreduce_microstep: 338.29 | step_microstep: 9.58
[2025-05-19 02:37:51,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1615.06 | bwd: 3386.16 | bwd_inner: 3047.67 | bwd_allreduce: 338.35 | step: 9.69
 50%|████▉     | 223/450 [19:34<18:53,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:37:54,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 893.29 | bwd_microstep: 1750.91 | bwd_inner_microstep: 1750.76 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1414
[2025-05-19 02:37:57,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.15 | optimizer_gradients: 0.58 | optimizer_step: 0.32
[2025-05-19 02:37:57,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 858.16 | bwd_microstep: 1635.60 | bwd_inner_microstep: 1628.02 | bwd_allreduce_microstep: 7.47 | step_microstep: 10.95
[2025-05-19 02:37:57,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1751.41 | bwd: 3386.53 | bwd_inner: 3378.87 | bwd_allreduce: 7.53 | step: 11.04
 50%|████▉     | 224/450 [19:39<19:01,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1427
[2025-05-19 02:37:59,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 851.31 | bwd_microstep: 1639.67 | bwd_inner_microstep: 1639.51 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1390
[2025-05-19 02:38:02,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:38:02,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 836.87 | bwd_microstep: 1595.71 | bwd_inner_microstep: 1588.07 | bwd_allreduce_microstep: 7.52 | step_microstep: 9.72
[2025-05-19 02:38:02,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1688.16 | bwd: 3235.40 | bwd_inner: 3227.67 | bwd_allreduce: 7.58 | step: 9.83
 50%|█████     | 225/450 [19:44<18:51,  5.03s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1369
[2025-05-19 02:38:04,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 817.15 | bwd_microstep: 1562.99 | bwd_inner_microstep: 1562.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:38:07,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.72 | optimizer_gradients: 0.70 | optimizer_step: 0.33
[2025-05-19 02:38:07,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 907.05 | bwd_microstep: 1751.89 | bwd_inner_microstep: 1744.11 | bwd_allreduce_microstep: 7.67 | step_microstep: 11.44
[2025-05-19 02:38:07,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1724.19 | bwd: 3314.92 | bwd_inner: 3307.04 | bwd_allreduce: 7.73 | step: 11.54
 50%|█████     | 226/450 [19:49<18:50,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1313
[2025-05-19 02:38:09,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 780.99 | bwd_microstep: 1477.34 | bwd_inner_microstep: 1477.17 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1405
[2025-05-19 02:38:11,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.72 | optimizer_gradients: 0.70 | optimizer_step: 0.34
[2025-05-19 02:38:11,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 844.01 | bwd_microstep: 1625.05 | bwd_inner_microstep: 1596.49 | bwd_allreduce_microstep: 28.44 | step_microstep: 11.19
[2025-05-19 02:38:11,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1624.97 | bwd: 3102.41 | bwd_inner: 3073.72 | bwd_allreduce: 28.51 | step: 11.29
 50%|█████     | 227/450 [19:54<18:28,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1479
[2025-05-19 02:38:14,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 882.68 | bwd_microstep: 1728.36 | bwd_inner_microstep: 1728.20 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1244
[2025-05-19 02:38:17,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:38:17,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 751.08 | bwd_microstep: 1645.74 | bwd_inner_microstep: 1375.09 | bwd_allreduce_microstep: 270.52 | step_microstep: 9.60
[2025-05-19 02:38:17,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1633.72 | bwd: 3374.13 | bwd_inner: 3103.35 | bwd_allreduce: 270.58 | step: 9.72
 51%|█████     | 228/450 [19:59<18:29,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1450
[2025-05-19 02:38:19,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 859.04 | bwd_microstep: 1665.84 | bwd_inner_microstep: 1665.68 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1407
[2025-05-19 02:38:22,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.12 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:38:22,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 847.08 | bwd_microstep: 1607.63 | bwd_inner_microstep: 1599.95 | bwd_allreduce_microstep: 7.58 | step_microstep: 10.53
[2025-05-19 02:38:22,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1706.09 | bwd: 3273.51 | bwd_inner: 3265.69 | bwd_allreduce: 7.64 | step: 10.62
 51%|█████     | 229/450 [20:04<18:26,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1253
[2025-05-19 02:38:24,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.17 | bwd_microstep: 1392.02 | bwd_inner_microstep: 1391.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1444
[2025-05-19 02:38:27,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:38:27,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 870.16 | bwd_microstep: 2019.24 | bwd_inner_microstep: 1662.44 | bwd_allreduce_microstep: 356.68 | step_microstep: 11.00
[2025-05-19 02:38:27,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1611.30 | bwd: 3411.28 | bwd_inner: 3054.41 | bwd_allreduce: 356.74 | step: 11.11
 51%|█████     | 230/450 [20:09<18:26,  5.03s/it]                                                 {'loss': 1.0932, 'grad_norm': 0.4299317002296448, 'learning_rate': 2.0288209533551147e-05, 'epoch': 0.51}
 51%|█████     | 230/450 [20:09<18:26,  5.03s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1323
[2025-05-19 02:38:29,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 783.92 | bwd_microstep: 1495.86 | bwd_inner_microstep: 1495.69 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1428
[2025-05-19 02:38:32,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.61 | optimizer_step: 0.37
[2025-05-19 02:38:32,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 861.30 | bwd_microstep: 1674.02 | bwd_inner_microstep: 1651.00 | bwd_allreduce_microstep: 22.92 | step_microstep: 9.62
[2025-05-19 02:38:32,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1645.18 | bwd: 3169.91 | bwd_inner: 3146.77 | bwd_allreduce: 22.99 | step: 9.72
 51%|█████▏    | 231/450 [20:14<18:11,  4.98s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1280
[2025-05-19 02:38:34,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 747.50 | bwd_microstep: 1412.31 | bwd_inner_microstep: 1412.18 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1312
[2025-05-19 02:38:36,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.07 | optimizer_gradients: 0.67 | optimizer_step: 0.31
[2025-05-19 02:38:36,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 786.55 | bwd_microstep: 1930.58 | bwd_inner_microstep: 1468.23 | bwd_allreduce_microstep: 462.22 | step_microstep: 10.81
[2025-05-19 02:38:36,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1534.03 | bwd: 3342.91 | bwd_inner: 2880.49 | bwd_allreduce: 462.28 | step: 10.92
 52%|█████▏    | 232/450 [20:19<18:02,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1481
[2025-05-19 02:38:39,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 889.73 | bwd_microstep: 1731.55 | bwd_inner_microstep: 1731.39 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1420
[2025-05-19 02:38:42,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:38:42,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 859.55 | bwd_microstep: 1665.32 | bwd_inner_microstep: 1633.07 | bwd_allreduce_microstep: 32.14 | step_microstep: 9.62
[2025-05-19 02:38:42,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1749.25 | bwd: 3396.90 | bwd_inner: 3364.54 | bwd_allreduce: 32.21 | step: 9.72
 52%|█████▏    | 233/450 [20:24<18:12,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1429
[2025-05-19 02:38:44,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 849.65 | bwd_microstep: 1643.40 | bwd_inner_microstep: 1643.26 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1351
[2025-05-19 02:38:47,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:38:47,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 818.22 | bwd_microstep: 1849.51 | bwd_inner_microstep: 1536.43 | bwd_allreduce_microstep: 312.98 | step_microstep: 11.07
[2025-05-19 02:38:47,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1667.85 | bwd: 3492.93 | bwd_inner: 3179.74 | bwd_allreduce: 313.04 | step: 11.17
 52%|█████▏    | 234/450 [20:29<18:19,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1446
[2025-05-19 02:38:49,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 857.90 | bwd_microstep: 1664.57 | bwd_inner_microstep: 1664.43 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1357
[2025-05-19 02:38:52,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:38:52,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 821.62 | bwd_microstep: 1557.59 | bwd_inner_microstep: 1549.98 | bwd_allreduce_microstep: 7.49 | step_microstep: 9.54
[2025-05-19 02:38:52,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1679.51 | bwd: 3222.18 | bwd_inner: 3214.49 | bwd_allreduce: 7.55 | step: 9.65
 52%|█████▏    | 235/450 [20:34<18:05,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1435
[2025-05-19 02:38:54,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 856.12 | bwd_microstep: 1646.61 | bwd_inner_microstep: 1646.45 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1349
[2025-05-19 02:38:57,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.64 | optimizer_step: 0.31
[2025-05-19 02:38:57,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 817.47 | bwd_microstep: 1826.70 | bwd_inner_microstep: 1543.45 | bwd_allreduce_microstep: 283.15 | step_microstep: 9.52
[2025-05-19 02:38:57,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1673.56 | bwd: 3473.33 | bwd_inner: 3189.96 | bwd_allreduce: 283.21 | step: 9.62
 52%|█████▏    | 236/450 [20:39<18:10,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1391
[2025-05-19 02:38:59,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 822.55 | bwd_microstep: 1583.24 | bwd_inner_microstep: 1583.10 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1115
[2025-05-19 02:39:02,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.94 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:39:02,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.41 | bwd_microstep: 1688.71 | bwd_inner_microstep: 1192.91 | bwd_allreduce_microstep: 495.71 | step_microstep: 9.53
[2025-05-19 02:39:02,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1496.93 | bwd: 3271.96 | bwd_inner: 2776.06 | bwd_allreduce: 495.77 | step: 9.63
 53%|█████▎    | 237/450 [20:44<17:47,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1319
[2025-05-19 02:39:04,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 783.46 | bwd_microstep: 1496.72 | bwd_inner_microstep: 1496.58 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1453
[2025-05-19 02:39:07,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.07 | optimizer_gradients: 0.58 | optimizer_step: 0.32
[2025-05-19 02:39:07,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 875.45 | bwd_microstep: 1677.10 | bwd_inner_microstep: 1669.43 | bwd_allreduce_microstep: 7.53 | step_microstep: 11.04
[2025-05-19 02:39:07,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1658.90 | bwd: 3173.84 | bwd_inner: 3166.09 | bwd_allreduce: 7.60 | step: 11.13
 53%|█████▎    | 238/450 [20:49<17:34,  4.98s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1448
[2025-05-19 02:39:09,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 855.41 | bwd_microstep: 1656.64 | bwd_inner_microstep: 1656.50 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1369
[2025-05-19 02:39:12,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.73 | optimizer_gradients: 0.67 | optimizer_step: 0.32
[2025-05-19 02:39:12,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 826.31 | bwd_microstep: 1560.75 | bwd_inner_microstep: 1552.86 | bwd_allreduce_microstep: 7.78 | step_microstep: 11.14
[2025-05-19 02:39:12,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1681.70 | bwd: 3217.42 | bwd_inner: 3209.42 | bwd_allreduce: 7.84 | step: 11.23
 53%|█████▎    | 239/450 [20:54<17:28,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1462
[2025-05-19 02:39:14,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 876.58 | bwd_microstep: 1688.51 | bwd_inner_microstep: 1688.35 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1297
[2025-05-19 02:39:17,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:39:17,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 779.12 | bwd_microstep: 1571.35 | bwd_inner_microstep: 1454.41 | bwd_allreduce_microstep: 116.83 | step_microstep: 10.82
[2025-05-19 02:39:17,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1655.67 | bwd: 3259.89 | bwd_inner: 3142.83 | bwd_allreduce: 116.90 | step: 10.95
 53%|█████▎    | 240/450 [20:59<17:23,  4.97s/it]                                                 {'loss': 1.1534, 'grad_norm': 0.41421452164649963, 'learning_rate': 1.8847760323508317e-05, 'epoch': 0.53}
 53%|█████▎    | 240/450 [20:59<17:23,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1470
[2025-05-19 02:39:19,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 874.60 | bwd_microstep: 1688.30 | bwd_inner_microstep: 1688.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1251
[2025-05-19 02:39:21,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:39:21,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 753.68 | bwd_microstep: 1394.78 | bwd_inner_microstep: 1386.91 | bwd_allreduce_microstep: 7.76 | step_microstep: 9.44
[2025-05-19 02:39:21,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1628.25 | bwd: 3083.11 | bwd_inner: 3075.14 | bwd_allreduce: 7.82 | step: 9.55
 54%|█████▎    | 241/450 [21:04<17:06,  4.91s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1420
[2025-05-19 02:39:24,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 847.98 | bwd_microstep: 1630.71 | bwd_inner_microstep: 1630.56 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:39:27,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.30 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:39:27,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 911.08 | bwd_microstep: 1744.92 | bwd_inner_microstep: 1737.27 | bwd_allreduce_microstep: 7.56 | step_microstep: 11.25
[2025-05-19 02:39:27,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1759.03 | bwd: 3375.66 | bwd_inner: 3367.90 | bwd_allreduce: 7.61 | step: 11.35
 54%|█████▍    | 242/450 [21:09<17:18,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1410
[2025-05-19 02:39:29,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 842.55 | bwd_microstep: 1628.17 | bwd_inner_microstep: 1628.04 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1370
[2025-05-19 02:39:32,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.12 | optimizer_gradients: 0.61 | optimizer_step: 0.31
[2025-05-19 02:39:32,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 825.81 | bwd_microstep: 1625.41 | bwd_inner_microstep: 1555.68 | bwd_allreduce_microstep: 69.61 | step_microstep: 11.09
[2025-05-19 02:39:32,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1668.33 | bwd: 3253.61 | bwd_inner: 3183.80 | bwd_allreduce: 69.67 | step: 11.21
 54%|█████▍    | 243/450 [21:14<17:12,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1394
[2025-05-19 02:39:34,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 822.85 | bwd_microstep: 1592.26 | bwd_inner_microstep: 1592.09 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1287
[2025-05-19 02:39:36,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.64 | optimizer_step: 0.33
[2025-05-19 02:39:36,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 775.70 | bwd_microstep: 1460.75 | bwd_inner_microstep: 1436.21 | bwd_allreduce_microstep: 24.41 | step_microstep: 9.58
[2025-05-19 02:39:36,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1598.51 | bwd: 3053.03 | bwd_inner: 3028.39 | bwd_allreduce: 24.48 | step: 9.69
 54%|█████▍    | 244/450 [21:18<16:50,  4.90s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:39:39,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 894.26 | bwd_microstep: 1744.33 | bwd_inner_microstep: 1744.18 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1214
[2025-05-19 02:39:41,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.07 | optimizer_gradients: 0.58 | optimizer_step: 0.32
[2025-05-19 02:39:41,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.65 | bwd_microstep: 1339.21 | bwd_inner_microstep: 1331.55 | bwd_allreduce_microstep: 7.55 | step_microstep: 9.68
[2025-05-19 02:39:41,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1622.89 | bwd: 3083.57 | bwd_inner: 3075.81 | bwd_allreduce: 7.62 | step: 9.77
 54%|█████▍    | 245/450 [21:23<16:36,  4.86s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1328
[2025-05-19 02:39:43,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 784.04 | bwd_microstep: 1492.72 | bwd_inner_microstep: 1492.58 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1285
[2025-05-19 02:39:46,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.88 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:39:46,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 773.03 | bwd_microstep: 1995.29 | bwd_inner_microstep: 1440.30 | bwd_allreduce_microstep: 554.89 | step_microstep: 9.42
[2025-05-19 02:39:46,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1557.05 | bwd: 3488.03 | bwd_inner: 2932.95 | bwd_allreduce: 554.95 | step: 9.51
 55%|█████▍    | 246/450 [21:28<16:45,  4.93s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1310
[2025-05-19 02:39:48,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 772.20 | bwd_microstep: 1472.22 | bwd_inner_microstep: 1472.08 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1268
[2025-05-19 02:39:51,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.61 | optimizer_step: 0.31
[2025-05-19 02:39:51,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 757.28 | bwd_microstep: 1886.99 | bwd_inner_microstep: 1407.41 | bwd_allreduce_microstep: 479.46 | step_microstep: 9.56
[2025-05-19 02:39:51,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1529.45 | bwd: 3359.24 | bwd_inner: 2879.55 | bwd_allreduce: 479.52 | step: 9.66
 55%|█████▍    | 247/450 [21:33<16:41,  4.93s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:39:54,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 897.19 | bwd_microstep: 1747.15 | bwd_inner_microstep: 1746.97 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1289
[2025-05-19 02:39:56,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:39:56,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 775.10 | bwd_microstep: 1446.20 | bwd_inner_microstep: 1438.35 | bwd_allreduce_microstep: 7.74 | step_microstep: 9.54
[2025-05-19 02:39:56,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1672.25 | bwd: 3193.37 | bwd_inner: 3185.40 | bwd_allreduce: 7.81 | step: 9.65
 55%|█████▌    | 248/450 [21:38<16:35,  4.93s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1421
[2025-05-19 02:39:59,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 851.84 | bwd_microstep: 1640.09 | bwd_inner_microstep: 1639.93 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1486
[2025-05-19 02:40:01,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.12 | optimizer_gradients: 0.58 | optimizer_step: 0.32
[2025-05-19 02:40:01,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 901.55 | bwd_microstep: 1776.83 | bwd_inner_microstep: 1738.49 | bwd_allreduce_microstep: 38.23 | step_microstep: 9.56
[2025-05-19 02:40:01,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1753.37 | bwd: 3416.94 | bwd_inner: 3378.50 | bwd_allreduce: 38.30 | step: 9.66
 55%|█████▌    | 249/450 [21:43<16:48,  5.02s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1258
[2025-05-19 02:40:03,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.98 | bwd_microstep: 1400.87 | bwd_inner_microstep: 1400.71 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1316
[2025-05-19 02:40:06,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.75 | optimizer_gradients: 0.74 | optimizer_step: 0.34
[2025-05-19 02:40:06,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 794.99 | bwd_microstep: 2057.75 | bwd_inner_microstep: 1484.36 | bwd_allreduce_microstep: 573.26 | step_microstep: 11.40
[2025-05-19 02:40:06,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1535.94 | bwd: 3458.65 | bwd_inner: 2885.14 | bwd_allreduce: 573.34 | step: 11.50
 56%|█████▌    | 250/450 [21:48<16:45,  5.03s/it]                                                 {'loss': 1.1128, 'grad_norm': 0.43392232060432434, 'learning_rate': 1.7413290844955478e-05, 'epoch': 0.55}
 56%|█████▌    | 250/450 [21:48<16:45,  5.03s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1346
[2025-05-19 02:40:09,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 800.70 | bwd_microstep: 1541.70 | bwd_inner_microstep: 1541.56 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1494
[2025-05-19 02:40:11,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.10 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:40:11,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 908.18 | bwd_microstep: 1768.65 | bwd_inner_microstep: 1740.91 | bwd_allreduce_microstep: 27.65 | step_microstep: 9.64
[2025-05-19 02:40:11,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1708.86 | bwd: 3310.37 | bwd_inner: 3282.53 | bwd_allreduce: 27.71 | step: 9.74
 56%|█████▌    | 251/450 [21:54<16:43,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1375
[2025-05-19 02:40:14,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 819.19 | bwd_microstep: 1576.30 | bwd_inner_microstep: 1576.13 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:40:17,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:40:17,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 907.43 | bwd_microstep: 1874.18 | bwd_inner_microstep: 1745.73 | bwd_allreduce_microstep: 128.35 | step_microstep: 9.52
[2025-05-19 02:40:17,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1726.60 | bwd: 3450.51 | bwd_inner: 3321.94 | bwd_allreduce: 128.42 | step: 9.63
 56%|█████▌    | 252/450 [21:59<16:49,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1298
[2025-05-19 02:40:19,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 766.20 | bwd_microstep: 1462.39 | bwd_inner_microstep: 1462.24 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1374
[2025-05-19 02:40:21,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.64 | optimizer_step: 0.33
[2025-05-19 02:40:21,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 828.14 | bwd_microstep: 1708.57 | bwd_inner_microstep: 1567.44 | bwd_allreduce_microstep: 141.01 | step_microstep: 9.58
[2025-05-19 02:40:21,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1594.30 | bwd: 3170.99 | bwd_inner: 3029.78 | bwd_allreduce: 141.08 | step: 9.67
 56%|█████▌    | 253/450 [22:04<16:27,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1435
[2025-05-19 02:40:24,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 851.16 | bwd_microstep: 1652.33 | bwd_inner_microstep: 1652.19 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1348
[2025-05-19 02:40:27,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:40:27,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 815.64 | bwd_microstep: 1751.90 | bwd_inner_microstep: 1545.42 | bwd_allreduce_microstep: 206.36 | step_microstep: 9.56
[2025-05-19 02:40:27,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1666.78 | bwd: 3404.25 | bwd_inner: 3197.67 | bwd_allreduce: 206.42 | step: 9.66
 56%|█████▋    | 254/450 [22:09<16:29,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1375
[2025-05-19 02:40:29,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 813.88 | bwd_microstep: 1573.69 | bwd_inner_microstep: 1573.53 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1321
[2025-05-19 02:40:32,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.60 | optimizer_step: 0.33
[2025-05-19 02:40:32,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 796.53 | bwd_microstep: 1734.25 | bwd_inner_microstep: 1491.52 | bwd_allreduce_microstep: 242.61 | step_microstep: 9.56
[2025-05-19 02:40:32,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1610.38 | bwd: 3307.97 | bwd_inner: 3065.13 | bwd_allreduce: 242.68 | step: 9.67
 57%|█████▋    | 255/450 [22:14<16:19,  5.02s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1352
[2025-05-19 02:40:34,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 795.20 | bwd_microstep: 1530.62 | bwd_inner_microstep: 1530.48 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1332
[2025-05-19 02:40:36,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:40:36,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 806.44 | bwd_microstep: 1574.72 | bwd_inner_microstep: 1506.11 | bwd_allreduce_microstep: 68.51 | step_microstep: 10.99
[2025-05-19 02:40:36,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1601.60 | bwd: 3105.37 | bwd_inner: 3036.67 | bwd_allreduce: 68.57 | step: 11.11
 57%|█████▋    | 256/450 [22:18<15:59,  4.95s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1455
[2025-05-19 02:40:39,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 859.60 | bwd_microstep: 1670.84 | bwd_inner_microstep: 1670.66 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1392
[2025-05-19 02:40:41,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.62 | optimizer_step: 0.33
[2025-05-19 02:40:41,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 831.19 | bwd_microstep: 1587.51 | bwd_inner_microstep: 1579.90 | bwd_allreduce_microstep: 7.49 | step_microstep: 9.58
[2025-05-19 02:40:41,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1690.75 | bwd: 3258.37 | bwd_inner: 3250.66 | bwd_allreduce: 7.56 | step: 9.70
 57%|█████▋    | 257/450 [22:23<15:57,  4.96s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1480
[2025-05-19 02:40:44,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 879.25 | bwd_microstep: 1723.79 | bwd_inner_microstep: 1723.66 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1344
[2025-05-19 02:40:46,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.20 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:40:46,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 800.54 | bwd_microstep: 1628.47 | bwd_inner_microstep: 1496.95 | bwd_allreduce_microstep: 131.43 | step_microstep: 11.16
[2025-05-19 02:40:46,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1679.77 | bwd: 3352.29 | bwd_inner: 3220.67 | bwd_allreduce: 131.45 | step: 11.25
 57%|█████▋    | 258/450 [22:29<15:59,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1437
[2025-05-19 02:40:49,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 849.54 | bwd_microstep: 1660.75 | bwd_inner_microstep: 1660.59 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1301
[2025-05-19 02:40:51,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.71 | optimizer_gradients: 0.74 | optimizer_step: 0.34
[2025-05-19 02:40:51,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 780.90 | bwd_microstep: 1555.04 | bwd_inner_microstep: 1454.17 | bwd_allreduce_microstep: 100.75 | step_microstep: 11.42
[2025-05-19 02:40:51,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1630.43 | bwd: 3215.81 | bwd_inner: 3114.84 | bwd_allreduce: 100.81 | step: 11.51
 58%|█████▊    | 259/450 [22:33<15:49,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1333
[2025-05-19 02:40:54,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 787.87 | bwd_microstep: 1511.91 | bwd_inner_microstep: 1511.73 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1349
[2025-05-19 02:40:56,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.74 | optimizer_gradients: 0.65 | optimizer_step: 0.34
[2025-05-19 02:40:56,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 817.50 | bwd_microstep: 1589.23 | bwd_inner_microstep: 1535.95 | bwd_allreduce_microstep: 53.18 | step_microstep: 11.15
[2025-05-19 02:40:56,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1605.34 | bwd: 3101.17 | bwd_inner: 3047.75 | bwd_allreduce: 53.26 | step: 11.24
 58%|█████▊    | 260/450 [22:38<15:32,  4.91s/it]                                                 {'loss': 1.0974, 'grad_norm': 0.41431447863578796, 'learning_rate': 1.599224550593319e-05, 'epoch': 0.58}
 58%|█████▊    | 260/450 [22:38<15:32,  4.91s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1456
[2025-05-19 02:40:59,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 852.13 | bwd_microstep: 1667.71 | bwd_inner_microstep: 1667.57 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1240
[2025-05-19 02:41:01,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.70 | optimizer_step: 0.32
[2025-05-19 02:41:01,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.71 | bwd_microstep: 1726.83 | bwd_inner_microstep: 1369.07 | bwd_allreduce_microstep: 357.65 | step_microstep: 9.68
[2025-05-19 02:41:01,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1593.80 | bwd: 3394.57 | bwd_inner: 3036.69 | bwd_allreduce: 357.71 | step: 9.77
 58%|█████▊    | 261/450 [22:43<15:35,  4.95s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1219
[2025-05-19 02:41:03,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.63 | bwd_microstep: 1356.10 | bwd_inner_microstep: 1355.94 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1222
[2025-05-19 02:41:06,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.84 | optimizer_gradients: 0.66 | optimizer_step: 0.31
[2025-05-19 02:41:06,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.80 | bwd_microstep: 2036.07 | bwd_inner_microstep: 1358.69 | bwd_allreduce_microstep: 677.30 | step_microstep: 9.45
[2025-05-19 02:41:06,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1457.41 | bwd: 3392.19 | bwd_inner: 2714.69 | bwd_allreduce: 677.35 | step: 9.56
 58%|█████▊    | 262/450 [22:48<15:27,  4.93s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1396
[2025-05-19 02:41:08,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 823.51 | bwd_microstep: 1600.00 | bwd_inner_microstep: 1599.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1417
[2025-05-19 02:41:11,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:41:11,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 855.74 | bwd_microstep: 1834.76 | bwd_inner_microstep: 1626.38 | bwd_allreduce_microstep: 208.28 | step_microstep: 9.66
[2025-05-19 02:41:11,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1679.23 | bwd: 3434.79 | bwd_inner: 3226.30 | bwd_allreduce: 208.35 | step: 9.76
 58%|█████▊    | 263/450 [22:53<15:35,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1431
[2025-05-19 02:41:14,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 849.13 | bwd_microstep: 1655.43 | bwd_inner_microstep: 1655.29 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1264
[2025-05-19 02:41:16,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:41:16,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 750.86 | bwd_microstep: 1806.30 | bwd_inner_microstep: 1392.29 | bwd_allreduce_microstep: 413.92 | step_microstep: 9.55
[2025-05-19 02:41:16,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1599.97 | bwd: 3461.76 | bwd_inner: 3047.65 | bwd_allreduce: 413.94 | step: 9.65
 59%|█████▊    | 264/450 [22:58<15:36,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1436
[2025-05-19 02:41:19,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 846.78 | bwd_microstep: 1652.34 | bwd_inner_microstep: 1652.16 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1212
[2025-05-19 02:41:22,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.18 | optimizer_gradients: 0.66 | optimizer_step: 0.31
[2025-05-19 02:41:22,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.34 | bwd_microstep: 2034.75 | bwd_inner_microstep: 1326.15 | bwd_allreduce_microstep: 708.49 | step_microstep: 11.49
[2025-05-19 02:41:22,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1576.08 | bwd: 3687.12 | bwd_inner: 2978.38 | bwd_allreduce: 708.55 | step: 11.62
 59%|█████▉    | 265/450 [23:04<15:47,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1494
[2025-05-19 02:41:24,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 886.40 | bwd_microstep: 1738.11 | bwd_inner_microstep: 1737.97 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1393
[2025-05-19 02:41:27,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.60 | optimizer_step: 0.33
[2025-05-19 02:41:27,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 838.87 | bwd_microstep: 1592.49 | bwd_inner_microstep: 1584.87 | bwd_allreduce_microstep: 7.49 | step_microstep: 9.63
[2025-05-19 02:41:27,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1725.25 | bwd: 3330.62 | bwd_inner: 3322.93 | bwd_allreduce: 7.55 | step: 9.73
 59%|█████▉    | 266/450 [23:09<15:41,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1443
[2025-05-19 02:41:29,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 858.90 | bwd_microstep: 1660.28 | bwd_inner_microstep: 1660.14 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1309
[2025-05-19 02:41:32,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.64 | optimizer_step: 0.31
[2025-05-19 02:41:32,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 785.55 | bwd_microstep: 1699.52 | bwd_inner_microstep: 1463.16 | bwd_allreduce_microstep: 236.25 | step_microstep: 9.53
[2025-05-19 02:41:32,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1644.43 | bwd: 3359.82 | bwd_inner: 3123.38 | bwd_allreduce: 236.31 | step: 9.64
 59%|█████▉    | 267/450 [23:14<15:33,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1357
[2025-05-19 02:41:34,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 803.42 | bwd_microstep: 1546.04 | bwd_inner_microstep: 1545.90 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1380
[2025-05-19 02:41:37,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:41:37,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 832.25 | bwd_microstep: 1999.89 | bwd_inner_microstep: 1572.50 | bwd_allreduce_microstep: 427.26 | step_microstep: 9.56
[2025-05-19 02:41:37,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1635.66 | bwd: 3545.95 | bwd_inner: 3118.49 | bwd_allreduce: 427.33 | step: 9.66
 60%|█████▉    | 268/450 [23:19<15:35,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1134
[2025-05-19 02:41:39,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 666.90 | bwd_microstep: 1229.65 | bwd_inner_microstep: 1229.49 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1248
[2025-05-19 02:41:42,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.15 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:41:42,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.44 | bwd_microstep: 2108.87 | bwd_inner_microstep: 1372.89 | bwd_allreduce_microstep: 735.88 | step_microstep: 11.20
[2025-05-19 02:41:42,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1411.30 | bwd: 3338.55 | bwd_inner: 2602.45 | bwd_allreduce: 735.94 | step: 11.30
 60%|█████▉    | 269/450 [23:24<15:12,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:41:44,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 894.07 | bwd_microstep: 1743.32 | bwd_inner_microstep: 1743.18 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1472
[2025-05-19 02:41:47,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.75 | optimizer_gradients: 0.68 | optimizer_step: 0.33
[2025-05-19 02:41:47,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 880.43 | bwd_microstep: 1677.25 | bwd_inner_microstep: 1669.35 | bwd_allreduce_microstep: 7.77 | step_microstep: 11.32
[2025-05-19 02:41:47,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1774.47 | bwd: 3420.60 | bwd_inner: 3412.63 | bwd_allreduce: 7.83 | step: 11.42
 60%|██████    | 270/450 [23:29<15:18,  5.10s/it]                                                 {'loss': 1.1165, 'grad_norm': 0.4124118983745575, 'learning_rate': 1.4591999047769842e-05, 'epoch': 0.6}
 60%|██████    | 270/450 [23:29<15:18,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1342
[2025-05-19 02:41:49,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 795.16 | bwd_microstep: 1513.45 | bwd_inner_microstep: 1513.31 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1376
[2025-05-19 02:41:52,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.63 | optimizer_step: 0.34
[2025-05-19 02:41:52,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 821.16 | bwd_microstep: 2019.60 | bwd_inner_microstep: 1551.77 | bwd_allreduce_microstep: 467.73 | step_microstep: 10.78
[2025-05-19 02:41:52,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1616.29 | bwd: 3533.07 | bwd_inner: 3065.15 | bwd_allreduce: 467.79 | step: 10.89
 60%|██████    | 271/450 [23:34<15:19,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1261
[2025-05-19 02:41:54,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.57 | bwd_microstep: 1402.24 | bwd_inner_microstep: 1402.10 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1322
[2025-05-19 02:41:57,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.93 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:41:57,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 795.34 | bwd_microstep: 1923.52 | bwd_inner_microstep: 1488.56 | bwd_allreduce_microstep: 434.86 | step_microstep: 9.48
[2025-05-19 02:41:57,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1530.88 | bwd: 3325.78 | bwd_inner: 2890.71 | bwd_allreduce: 434.92 | step: 9.57
 60%|██████    | 272/450 [23:39<15:01,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1361
[2025-05-19 02:42:00,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 805.24 | bwd_microstep: 1552.07 | bwd_inner_microstep: 1551.93 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1429
[2025-05-19 02:42:02,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.62 | optimizer_step: 0.31
[2025-05-19 02:42:02,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 859.80 | bwd_microstep: 1857.02 | bwd_inner_microstep: 1643.94 | bwd_allreduce_microstep: 212.98 | step_microstep: 9.65
[2025-05-19 02:42:02,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1665.01 | bwd: 3409.11 | bwd_inner: 3195.94 | bwd_allreduce: 213.04 | step: 9.75
 61%|██████    | 273/450 [23:44<15:00,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1475
[2025-05-19 02:42:05,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 878.01 | bwd_microstep: 1724.94 | bwd_inner_microstep: 1724.79 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1267
[2025-05-19 02:42:07,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.11 | optimizer_gradients: 0.59 | optimizer_step: 0.32
[2025-05-19 02:42:07,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 760.38 | bwd_microstep: 1408.42 | bwd_inner_microstep: 1400.74 | bwd_allreduce_microstep: 7.56 | step_microstep: 11.36
[2025-05-19 02:42:07,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1638.36 | bwd: 3133.38 | bwd_inner: 3125.63 | bwd_allreduce: 7.62 | step: 11.46
 61%|██████    | 274/450 [23:49<14:41,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1358
[2025-05-19 02:42:09,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 808.64 | bwd_microstep: 1550.79 | bwd_inner_microstep: 1550.65 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:42:12,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.60 | optimizer_step: 0.31
[2025-05-19 02:42:12,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 906.57 | bwd_microstep: 1800.80 | bwd_inner_microstep: 1735.75 | bwd_allreduce_microstep: 64.96 | step_microstep: 9.68
[2025-05-19 02:42:12,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1715.18 | bwd: 3351.62 | bwd_inner: 3286.45 | bwd_allreduce: 65.02 | step: 9.78
 61%|██████    | 275/450 [23:54<14:42,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1410
[2025-05-19 02:42:15,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 837.08 | bwd_microstep: 1628.28 | bwd_inner_microstep: 1628.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1293
[2025-05-19 02:42:17,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.14 | optimizer_gradients: 0.64 | optimizer_step: 0.33
[2025-05-19 02:42:17,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 777.35 | bwd_microstep: 1753.29 | bwd_inner_microstep: 1448.78 | bwd_allreduce_microstep: 304.40 | step_microstep: 11.20
[2025-05-19 02:42:17,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1614.40 | bwd: 3381.59 | bwd_inner: 3076.99 | bwd_allreduce: 304.47 | step: 11.31
 61%|██████▏   | 276/450 [23:59<14:37,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1246
[2025-05-19 02:42:19,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.72 | bwd_microstep: 1384.21 | bwd_inner_microstep: 1383.97 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1342
[2025-05-19 02:42:22,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:42:22,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 801.78 | bwd_microstep: 2096.29 | bwd_inner_microstep: 1512.98 | bwd_allreduce_microstep: 583.22 | step_microstep: 11.13
[2025-05-19 02:42:22,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1535.46 | bwd: 3480.54 | bwd_inner: 2897.02 | bwd_allreduce: 583.32 | step: 11.27
 62%|██████▏   | 277/450 [24:05<14:34,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:42:25,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 888.57 | bwd_microstep: 1744.39 | bwd_inner_microstep: 1744.26 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1453
[2025-05-19 02:42:28,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:42:28,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 876.56 | bwd_microstep: 1675.73 | bwd_inner_microstep: 1668.16 | bwd_allreduce_microstep: 7.47 | step_microstep: 9.57
[2025-05-19 02:42:28,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1765.12 | bwd: 3420.14 | bwd_inner: 3412.49 | bwd_allreduce: 7.53 | step: 9.67
 62%|██████▏   | 278/450 [24:10<14:38,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1482
[2025-05-19 02:42:30,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 887.61 | bwd_microstep: 1731.57 | bwd_inner_microstep: 1731.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1398
[2025-05-19 02:42:33,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.14 | optimizer_gradients: 0.64 | optimizer_step: 0.31
[2025-05-19 02:42:33,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 841.54 | bwd_microstep: 1600.89 | bwd_inner_microstep: 1593.25 | bwd_allreduce_microstep: 7.52 | step_microstep: 9.65
[2025-05-19 02:42:33,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1729.11 | bwd: 3332.50 | bwd_inner: 3324.75 | bwd_allreduce: 7.59 | step: 9.78
 62%|██████▏   | 279/450 [24:15<14:34,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1351
[2025-05-19 02:42:35,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 806.97 | bwd_microstep: 1545.28 | bwd_inner_microstep: 1545.13 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1276
[2025-05-19 02:42:37,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.65 | optimizer_step: 0.35
[2025-05-19 02:42:37,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 762.79 | bwd_microstep: 1593.49 | bwd_inner_microstep: 1408.24 | bwd_allreduce_microstep: 185.14 | step_microstep: 9.53
[2025-05-19 02:42:37,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1569.73 | bwd: 3138.79 | bwd_inner: 2953.45 | bwd_allreduce: 185.20 | step: 9.62
 62%|██████▏   | 280/450 [24:20<14:11,  5.01s/it]                                                 {'loss': 1.0968, 'grad_norm': 0.41527020931243896, 'learning_rate': 1.3219818272685912e-05, 'epoch': 0.62}
 62%|██████▏   | 280/450 [24:20<14:11,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1408
[2025-05-19 02:42:40,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 821.05 | bwd_microstep: 1588.28 | bwd_inner_microstep: 1588.14 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1351
[2025-05-19 02:42:42,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.12 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:42:42,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 816.38 | bwd_microstep: 1570.36 | bwd_inner_microstep: 1549.32 | bwd_allreduce_microstep: 20.91 | step_microstep: 9.62
[2025-05-19 02:42:42,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1637.41 | bwd: 3158.67 | bwd_inner: 3137.56 | bwd_allreduce: 20.98 | step: 9.74
 62%|██████▏   | 281/450 [24:25<13:58,  4.96s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1382
[2025-05-19 02:42:45,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 820.20 | bwd_microstep: 1576.61 | bwd_inner_microstep: 1576.45 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1379
[2025-05-19 02:42:48,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.66 | optimizer_step: 0.33
[2025-05-19 02:42:48,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 831.07 | bwd_microstep: 1938.50 | bwd_inner_microstep: 1570.65 | bwd_allreduce_microstep: 367.70 | step_microstep: 9.66
[2025-05-19 02:42:48,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1651.24 | bwd: 3515.13 | bwd_inner: 3147.20 | bwd_allreduce: 367.77 | step: 9.78
 63%|██████▎   | 282/450 [24:30<14:06,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1345
[2025-05-19 02:42:50,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 798.90 | bwd_microstep: 1542.12 | bwd_inner_microstep: 1541.93 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1299
[2025-05-19 02:42:53,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.64 | optimizer_step: 0.33
[2025-05-19 02:42:53,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 779.14 | bwd_microstep: 1996.32 | bwd_inner_microstep: 1457.23 | bwd_allreduce_microstep: 538.96 | step_microstep: 9.61
[2025-05-19 02:42:53,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1578.02 | bwd: 3538.46 | bwd_inner: 2999.25 | bwd_allreduce: 539.03 | step: 9.72
 63%|██████▎   | 283/450 [24:35<14:07,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1291
[2025-05-19 02:42:55,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 758.84 | bwd_microstep: 1441.81 | bwd_inner_microstep: 1441.66 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1426
[2025-05-19 02:42:57,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:42:57,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 861.46 | bwd_microstep: 1632.06 | bwd_inner_microstep: 1624.38 | bwd_allreduce_microstep: 7.56 | step_microstep: 9.59
[2025-05-19 02:42:57,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1620.26 | bwd: 3073.89 | bwd_inner: 3066.09 | bwd_allreduce: 7.62 | step: 9.70
 63%|██████▎   | 284/450 [24:40<13:46,  4.98s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:43:00,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 896.03 | bwd_microstep: 1748.31 | bwd_inner_microstep: 1748.13 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:43:03,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:43:03,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 911.50 | bwd_microstep: 1748.73 | bwd_inner_microstep: 1741.08 | bwd_allreduce_microstep: 7.53 | step_microstep: 9.69
[2025-05-19 02:43:03,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1807.50 | bwd: 3497.06 | bwd_inner: 3489.30 | bwd_allreduce: 7.59 | step: 9.80
 63%|██████▎   | 285/450 [24:45<14:00,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1399
[2025-05-19 02:43:05,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 832.72 | bwd_microstep: 1595.15 | bwd_inner_microstep: 1595.00 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1325
[2025-05-19 02:43:08,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.66 | optimizer_step: 0.32
[2025-05-19 02:43:08,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 799.12 | bwd_microstep: 1624.19 | bwd_inner_microstep: 1484.64 | bwd_allreduce_microstep: 139.41 | step_microstep: 11.21
[2025-05-19 02:43:08,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1631.81 | bwd: 3219.36 | bwd_inner: 3079.73 | bwd_allreduce: 139.48 | step: 11.32
 64%|██████▎   | 286/450 [24:50<13:45,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1372
[2025-05-19 02:43:10,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 810.88 | bwd_microstep: 1555.70 | bwd_inner_microstep: 1555.55 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1305
[2025-05-19 02:43:13,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.89 | optimizer_gradients: 0.64 | optimizer_step: 0.31
[2025-05-19 02:43:13,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 786.23 | bwd_microstep: 1963.21 | bwd_inner_microstep: 1453.14 | bwd_allreduce_microstep: 509.95 | step_microstep: 9.55
[2025-05-19 02:43:13,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1597.09 | bwd: 3518.93 | bwd_inner: 3008.75 | bwd_allreduce: 510.01 | step: 9.65
 64%|██████▍   | 287/450 [24:55<13:47,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1380
[2025-05-19 02:43:15,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 817.55 | bwd_microstep: 1585.77 | bwd_inner_microstep: 1585.60 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:43:18,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.62 | optimizer_step: 0.31
[2025-05-19 02:43:18,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 908.18 | bwd_microstep: 2015.51 | bwd_inner_microstep: 1743.87 | bwd_allreduce_microstep: 271.53 | step_microstep: 9.43
[2025-05-19 02:43:18,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1725.69 | bwd: 3601.31 | bwd_inner: 3329.57 | bwd_allreduce: 271.60 | step: 9.55
 64%|██████▍   | 288/450 [25:00<13:57,  5.17s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1425
[2025-05-19 02:43:21,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 844.57 | bwd_microstep: 1641.82 | bwd_inner_microstep: 1641.66 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1351
[2025-05-19 02:43:23,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.66 | optimizer_step: 0.31
[2025-05-19 02:43:23,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 818.56 | bwd_microstep: 1782.71 | bwd_inner_microstep: 1550.25 | bwd_allreduce_microstep: 232.33 | step_microstep: 9.61
[2025-05-19 02:43:23,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1663.11 | bwd: 3424.55 | bwd_inner: 3191.99 | bwd_allreduce: 232.40 | step: 9.71
 64%|██████▍   | 289/450 [25:06<13:50,  5.16s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1404
[2025-05-19 02:43:26,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 829.36 | bwd_microstep: 1600.77 | bwd_inner_microstep: 1600.61 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1326
[2025-05-19 02:43:28,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.95 | optimizer_gradients: 0.69 | optimizer_step: 0.32
[2025-05-19 02:43:28,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 798.84 | bwd_microstep: 1689.68 | bwd_inner_microstep: 1499.40 | bwd_allreduce_microstep: 190.16 | step_microstep: 9.55
[2025-05-19 02:43:28,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1628.17 | bwd: 3290.48 | bwd_inner: 3100.10 | bwd_allreduce: 190.23 | step: 9.65
 64%|██████▍   | 290/450 [25:11<13:36,  5.10s/it]                                                 {'loss': 1.1084, 'grad_norm': 0.40323513746261597, 'learning_rate': 1.188282433156529e-05, 'epoch': 0.64}
 64%|██████▍   | 290/450 [25:11<13:36,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1464
[2025-05-19 02:43:31,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 859.30 | bwd_microstep: 1676.03 | bwd_inner_microstep: 1675.89 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1393
[2025-05-19 02:43:33,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.74 | optimizer_gradients: 0.68 | optimizer_step: 0.34
[2025-05-19 02:43:33,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 843.52 | bwd_microstep: 1596.72 | bwd_inner_microstep: 1588.73 | bwd_allreduce_microstep: 7.86 | step_microstep: 11.26
[2025-05-19 02:43:33,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1702.79 | bwd: 3272.78 | bwd_inner: 3264.70 | bwd_allreduce: 7.91 | step: 11.35
 65%|██████▍   | 291/450 [25:16<13:27,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1421
[2025-05-19 02:43:36,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 848.73 | bwd_microstep: 1636.92 | bwd_inner_microstep: 1636.78 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1351
[2025-05-19 02:43:39,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.72 | optimizer_step: 0.38
[2025-05-19 02:43:39,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 819.42 | bwd_microstep: 1740.88 | bwd_inner_microstep: 1545.06 | bwd_allreduce_microstep: 195.70 | step_microstep: 11.53
[2025-05-19 02:43:39,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1668.12 | bwd: 3377.82 | bwd_inner: 3181.92 | bwd_allreduce: 195.77 | step: 11.65
 65%|██████▍   | 292/450 [25:21<13:23,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1463
[2025-05-19 02:43:41,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 870.50 | bwd_microstep: 1696.33 | bwd_inner_microstep: 1696.19 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1409
[2025-05-19 02:43:44,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:43:44,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 855.85 | bwd_microstep: 1786.20 | bwd_inner_microstep: 1621.52 | bwd_allreduce_microstep: 164.56 | step_microstep: 9.71
[2025-05-19 02:43:44,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1726.32 | bwd: 3482.54 | bwd_inner: 3317.78 | bwd_allreduce: 164.62 | step: 9.83
 65%|██████▌   | 293/450 [25:26<13:27,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:43:46,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 898.30 | bwd_microstep: 1751.67 | bwd_inner_microstep: 1751.52 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:43:49,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.73 | optimizer_gradients: 0.71 | optimizer_step: 0.33
[2025-05-19 02:43:49,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 912.03 | bwd_microstep: 1752.37 | bwd_inner_microstep: 1744.32 | bwd_allreduce_microstep: 7.95 | step_microstep: 11.30
[2025-05-19 02:43:49,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1810.30 | bwd: 3504.07 | bwd_inner: 3495.90 | bwd_allreduce: 8.03 | step: 11.40
 65%|██████▌   | 294/450 [25:31<13:32,  5.21s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1393
[2025-05-19 02:43:52,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 829.50 | bwd_microstep: 1590.08 | bwd_inner_microstep: 1589.95 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1323
[2025-05-19 02:43:54,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:43:54,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 798.97 | bwd_microstep: 1523.79 | bwd_inner_microstep: 1495.20 | bwd_allreduce_microstep: 28.48 | step_microstep: 9.58
[2025-05-19 02:43:54,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1628.44 | bwd: 3113.89 | bwd_inner: 3085.22 | bwd_allreduce: 28.54 | step: 9.69
 66%|██████▌   | 295/450 [25:36<13:08,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1325
[2025-05-19 02:43:56,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 786.20 | bwd_microstep: 1499.26 | bwd_inner_microstep: 1499.11 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1456
[2025-05-19 02:43:59,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.65 | optimizer_step: 0.33
[2025-05-19 02:43:59,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 868.24 | bwd_microstep: 1752.16 | bwd_inner_microstep: 1663.73 | bwd_allreduce_microstep: 88.31 | step_microstep: 10.82
[2025-05-19 02:43:59,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1654.39 | bwd: 3251.44 | bwd_inner: 3162.92 | bwd_allreduce: 88.37 | step: 10.94
 66%|██████▌   | 296/450 [25:41<12:57,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1366
[2025-05-19 02:44:01,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 809.00 | bwd_microstep: 1554.85 | bwd_inner_microstep: 1554.70 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1282
[2025-05-19 02:44:04,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:44:04,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 773.58 | bwd_microstep: 2156.35 | bwd_inner_microstep: 1428.39 | bwd_allreduce_microstep: 727.87 | step_microstep: 9.51
[2025-05-19 02:44:04,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1582.55 | bwd: 3711.22 | bwd_inner: 2983.14 | bwd_allreduce: 727.93 | step: 9.61
 66%|██████▌   | 297/450 [25:46<13:06,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1415
[2025-05-19 02:44:07,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 841.10 | bwd_microstep: 1628.51 | bwd_inner_microstep: 1628.36 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1211
[2025-05-19 02:44:09,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.95 | optimizer_gradients: 0.64 | optimizer_step: 0.31
[2025-05-19 02:44:09,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.72 | bwd_microstep: 1673.40 | bwd_inner_microstep: 1330.42 | bwd_allreduce_microstep: 342.86 | step_microstep: 9.48
[2025-05-19 02:44:09,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1572.80 | bwd: 3301.94 | bwd_inner: 2958.83 | bwd_allreduce: 342.93 | step: 9.58
 66%|██████▌   | 298/450 [25:51<12:51,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:44:12,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 896.10 | bwd_microstep: 1750.80 | bwd_inner_microstep: 1750.67 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1292
[2025-05-19 02:44:14,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.58 | optimizer_step: 0.34
[2025-05-19 02:44:14,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 782.34 | bwd_microstep: 1448.57 | bwd_inner_microstep: 1441.00 | bwd_allreduce_microstep: 7.46 | step_microstep: 9.51
[2025-05-19 02:44:14,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1678.41 | bwd: 3199.39 | bwd_inner: 3191.74 | bwd_allreduce: 7.52 | step: 9.63
 66%|██████▋   | 299/450 [25:56<12:39,  5.03s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1391
[2025-05-19 02:44:17,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 828.52 | bwd_microstep: 1589.73 | bwd_inner_microstep: 1589.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1423
[2025-05-19 02:44:19,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.73 | optimizer_gradients: 0.72 | optimizer_step: 0.33
[2025-05-19 02:44:19,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 858.85 | bwd_microstep: 1643.77 | bwd_inner_microstep: 1635.83 | bwd_allreduce_microstep: 7.82 | step_microstep: 11.43
[2025-05-19 02:44:19,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1687.34 | bwd: 3233.53 | bwd_inner: 3225.47 | bwd_allreduce: 7.90 | step: 11.56
 67%|██████▋   | 300/450 [26:01<12:32,  5.01s/it]                                                 {'loss': 1.1413, 'grad_norm': 0.3856849670410156, 'learning_rate': 1.0587955767607594e-05, 'epoch': 0.67}
 67%|██████▋   | 300/450 [26:01<12:32,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1484
[2025-05-19 02:44:22,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 890.86 | bwd_microstep: 1734.22 | bwd_inner_microstep: 1734.08 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1475
[2025-05-19 02:44:24,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:44:24,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 895.96 | bwd_microstep: 1729.61 | bwd_inner_microstep: 1721.89 | bwd_allreduce_microstep: 7.63 | step_microstep: 11.25
[2025-05-19 02:44:24,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1786.80 | bwd: 3463.85 | bwd_inner: 3456.04 | bwd_allreduce: 7.69 | step: 11.35
 67%|██████▋   | 301/450 [26:07<12:40,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1451
[2025-05-19 02:44:27,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 864.08 | bwd_microstep: 1670.50 | bwd_inner_microstep: 1670.36 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1229
[2025-05-19 02:44:30,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.67 | optimizer_step: 0.32
[2025-05-19 02:44:30,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 746.17 | bwd_microstep: 2024.77 | bwd_inner_microstep: 1366.50 | bwd_allreduce_microstep: 658.17 | step_microstep: 10.72
[2025-05-19 02:44:30,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1610.23 | bwd: 3695.29 | bwd_inner: 3036.93 | bwd_allreduce: 658.23 | step: 10.82
 67%|██████▋   | 302/450 [26:12<12:46,  5.18s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1394
[2025-05-19 02:44:32,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 822.96 | bwd_microstep: 1590.77 | bwd_inner_microstep: 1590.62 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1490
[2025-05-19 02:44:35,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.59 | optimizer_step: 0.34
[2025-05-19 02:44:35,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 906.20 | bwd_microstep: 1749.25 | bwd_inner_microstep: 1741.63 | bwd_allreduce_microstep: 7.50 | step_microstep: 10.82
[2025-05-19 02:44:35,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1729.13 | bwd: 3340.04 | bwd_inner: 3332.35 | bwd_allreduce: 7.56 | step: 10.92
 67%|██████▋   | 303/450 [26:17<12:38,  5.16s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1429
[2025-05-19 02:44:37,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 852.42 | bwd_microstep: 1636.22 | bwd_inner_microstep: 1636.06 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1379
[2025-05-19 02:44:40,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.67 | optimizer_step: 0.33
[2025-05-19 02:44:40,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 832.85 | bwd_microstep: 1575.72 | bwd_inner_microstep: 1568.03 | bwd_allreduce_microstep: 7.59 | step_microstep: 9.69
[2025-05-19 02:44:40,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1685.24 | bwd: 3211.96 | bwd_inner: 3204.17 | bwd_allreduce: 7.65 | step: 9.79
 68%|██████▊   | 304/450 [26:22<12:24,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1398
[2025-05-19 02:44:42,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 830.93 | bwd_microstep: 1597.56 | bwd_inner_microstep: 1597.33 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1289
[2025-05-19 02:44:45,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:44:45,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 768.11 | bwd_microstep: 1985.21 | bwd_inner_microstep: 1438.47 | bwd_allreduce_microstep: 546.64 | step_microstep: 9.58
[2025-05-19 02:44:45,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1599.00 | bwd: 3582.81 | bwd_inner: 3035.89 | bwd_allreduce: 546.73 | step: 9.71
 68%|██████▊   | 305/450 [26:27<12:25,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1455
[2025-05-19 02:44:48,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 858.71 | bwd_microstep: 1683.79 | bwd_inner_microstep: 1683.62 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1434
[2025-05-19 02:44:50,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.61 | optimizer_step: 0.34
[2025-05-19 02:44:50,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 865.60 | bwd_microstep: 1661.38 | bwd_inner_microstep: 1653.73 | bwd_allreduce_microstep: 7.54 | step_microstep: 9.54
[2025-05-19 02:44:50,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1724.27 | bwd: 3345.19 | bwd_inner: 3337.43 | bwd_allreduce: 7.60 | step: 9.65
 68%|██████▊   | 306/450 [26:32<12:19,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1368
[2025-05-19 02:44:53,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 809.74 | bwd_microstep: 1553.99 | bwd_inner_microstep: 1553.83 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1363
[2025-05-19 02:44:55,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.68 | optimizer_step: 0.32
[2025-05-19 02:44:55,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 822.80 | bwd_microstep: 1870.02 | bwd_inner_microstep: 1547.80 | bwd_allreduce_microstep: 322.11 | step_microstep: 9.57
[2025-05-19 02:44:55,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1632.52 | bwd: 3424.02 | bwd_inner: 3101.69 | bwd_allreduce: 322.18 | step: 9.66
 68%|██████▊   | 307/450 [26:38<12:13,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1444
[2025-05-19 02:44:58,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 854.20 | bwd_microstep: 1673.92 | bwd_inner_microstep: 1673.73 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1346
[2025-05-19 02:45:00,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:45:00,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 815.44 | bwd_microstep: 1553.46 | bwd_inner_microstep: 1545.79 | bwd_allreduce_microstep: 7.57 | step_microstep: 9.53
[2025-05-19 02:45:00,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1669.60 | bwd: 3227.40 | bwd_inner: 3219.62 | bwd_allreduce: 7.64 | step: 9.68
 68%|██████▊   | 308/450 [26:42<12:00,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:45:03,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 897.93 | bwd_microstep: 1743.07 | bwd_inner_microstep: 1742.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1397
[2025-05-19 02:45:05,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.61 | optimizer_step: 0.34
[2025-05-19 02:45:05,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 843.45 | bwd_microstep: 1606.01 | bwd_inner_microstep: 1598.05 | bwd_allreduce_microstep: 7.82 | step_microstep: 9.61
[2025-05-19 02:45:05,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1741.34 | bwd: 3349.11 | bwd_inner: 3341.06 | bwd_allreduce: 7.88 | step: 9.71
 69%|██████▊   | 309/450 [26:48<11:58,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1441
[2025-05-19 02:45:08,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 862.05 | bwd_microstep: 1668.69 | bwd_inner_microstep: 1668.53 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1407
[2025-05-19 02:45:11,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:45:11,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 848.18 | bwd_microstep: 1829.49 | bwd_inner_microstep: 1595.87 | bwd_allreduce_microstep: 233.48 | step_microstep: 9.56
[2025-05-19 02:45:11,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1710.18 | bwd: 3498.20 | bwd_inner: 3264.49 | bwd_allreduce: 233.56 | step: 9.67
 69%|██████▉   | 310/450 [26:53<12:00,  5.15s/it]                                                 {'loss': 1.0579, 'grad_norm': 0.40526703000068665, 'learning_rate': 9.341932507652053e-06, 'epoch': 0.69}
 69%|██████▉   | 310/450 [26:53<12:00,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1463
[2025-05-19 02:45:13,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 871.28 | bwd_microstep: 1695.25 | bwd_inner_microstep: 1695.08 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1321
[2025-05-19 02:45:16,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.68 | optimizer_step: 0.32
[2025-05-19 02:45:16,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 800.31 | bwd_microstep: 1559.75 | bwd_inner_microstep: 1485.87 | bwd_allreduce_microstep: 73.78 | step_microstep: 10.78
[2025-05-19 02:45:16,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1671.57 | bwd: 3255.03 | bwd_inner: 3181.03 | bwd_allreduce: 73.85 | step: 10.90
 69%|██████▉   | 311/450 [26:58<11:48,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1431
[2025-05-19 02:45:18,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 847.37 | bwd_microstep: 1649.85 | bwd_inner_microstep: 1649.71 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:45:21,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.59 | optimizer_step: 0.33
[2025-05-19 02:45:21,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 907.60 | bwd_microstep: 1752.26 | bwd_inner_microstep: 1744.49 | bwd_allreduce_microstep: 7.67 | step_microstep: 9.57
[2025-05-19 02:45:21,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1754.93 | bwd: 3402.13 | bwd_inner: 3394.26 | bwd_allreduce: 7.73 | step: 9.68
 69%|██████▉   | 312/450 [27:03<11:48,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1328
[2025-05-19 02:45:23,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 784.13 | bwd_microstep: 1491.76 | bwd_inner_microstep: 1491.62 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1305
[2025-05-19 02:45:26,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.83 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:45:26,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 782.70 | bwd_microstep: 1863.23 | bwd_inner_microstep: 1463.28 | bwd_allreduce_microstep: 399.85 | step_microstep: 9.38
[2025-05-19 02:45:26,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1566.80 | bwd: 3355.01 | bwd_inner: 2954.97 | bwd_allreduce: 399.92 | step: 9.48
 70%|██████▉   | 313/450 [27:08<11:36,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1425
[2025-05-19 02:45:28,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 845.27 | bwd_microstep: 1636.37 | bwd_inner_microstep: 1636.20 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1283
[2025-05-19 02:45:31,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.77 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:45:31,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 770.03 | bwd_microstep: 1596.65 | bwd_inner_microstep: 1431.60 | bwd_allreduce_microstep: 164.94 | step_microstep: 10.04
[2025-05-19 02:45:31,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1615.27 | bwd: 3233.04 | bwd_inner: 3067.86 | bwd_allreduce: 165.01 | step: 10.15
 70%|██████▉   | 314/450 [27:13<11:24,  5.03s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1459
[2025-05-19 02:45:33,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 867.00 | bwd_microstep: 1679.61 | bwd_inner_microstep: 1679.45 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1367
[2025-05-19 02:45:36,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:45:36,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 826.53 | bwd_microstep: 1564.09 | bwd_inner_microstep: 1556.40 | bwd_allreduce_microstep: 7.59 | step_microstep: 9.63
[2025-05-19 02:45:36,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1693.51 | bwd: 3243.73 | bwd_inner: 3235.91 | bwd_allreduce: 7.66 | step: 9.74
 70%|███████   | 315/450 [27:18<11:17,  5.02s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1295
[2025-05-19 02:45:38,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 768.21 | bwd_microstep: 1454.80 | bwd_inner_microstep: 1454.64 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1394
[2025-05-19 02:45:41,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.63 | optimizer_step: 0.33
[2025-05-19 02:45:41,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 840.76 | bwd_microstep: 2088.44 | bwd_inner_microstep: 1588.16 | bwd_allreduce_microstep: 500.14 | step_microstep: 9.67
[2025-05-19 02:45:41,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1608.94 | bwd: 3543.27 | bwd_inner: 3042.89 | bwd_allreduce: 500.21 | step: 9.78
 70%|███████   | 316/450 [27:23<11:19,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1399
[2025-05-19 02:45:43,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 825.07 | bwd_microstep: 1592.46 | bwd_inner_microstep: 1592.32 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1347
[2025-05-19 02:45:46,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.79 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:45:46,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 818.01 | bwd_microstep: 1705.88 | bwd_inner_microstep: 1541.47 | bwd_allreduce_microstep: 164.30 | step_microstep: 15.17
[2025-05-19 02:45:46,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1643.05 | bwd: 3298.36 | bwd_inner: 3133.84 | bwd_allreduce: 164.36 | step: 15.29
 70%|███████   | 317/450 [27:28<11:11,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1322
[2025-05-19 02:45:48,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 781.54 | bwd_microstep: 1488.73 | bwd_inner_microstep: 1488.59 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1294
[2025-05-19 02:45:51,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.83 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:45:51,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 775.92 | bwd_microstep: 1985.06 | bwd_inner_microstep: 1450.10 | bwd_allreduce_microstep: 534.85 | step_microstep: 9.52
[2025-05-19 02:45:51,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1557.43 | bwd: 3473.82 | bwd_inner: 2938.78 | bwd_allreduce: 534.91 | step: 9.60
 71%|███████   | 318/450 [27:33<11:08,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1355
[2025-05-19 02:45:53,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 803.19 | bwd_microstep: 1544.35 | bwd_inner_microstep: 1544.21 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1306
[2025-05-19 02:45:56,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.10 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:45:56,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 780.82 | bwd_microstep: 1462.44 | bwd_inner_microstep: 1454.79 | bwd_allreduce_microstep: 7.55 | step_microstep: 10.32
[2025-05-19 02:45:56,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1583.97 | bwd: 3006.82 | bwd_inner: 2999.05 | bwd_allreduce: 7.61 | step: 10.42
 71%|███████   | 319/450 [27:38<10:46,  4.94s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1369
[2025-05-19 02:45:58,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 813.86 | bwd_microstep: 1564.34 | bwd_inner_microstep: 1564.20 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:46:01,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.61 | optimizer_step: 0.33
[2025-05-19 02:46:01,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 909.49 | bwd_microstep: 1759.95 | bwd_inner_microstep: 1741.22 | bwd_allreduce_microstep: 18.64 | step_microstep: 9.69
[2025-05-19 02:46:01,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1723.32 | bwd: 3324.32 | bwd_inner: 3305.48 | bwd_allreduce: 18.69 | step: 9.79
 71%|███████   | 320/450 [27:43<10:48,  4.99s/it]                                                 {'loss': 1.1181, 'grad_norm': 0.40931376814842224, 'learning_rate': 8.151220988045929e-06, 'epoch': 0.71}
 71%|███████   | 320/450 [27:43<10:48,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1372
[2025-05-19 02:46:03,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 815.70 | bwd_microstep: 1563.06 | bwd_inner_microstep: 1562.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1381
[2025-05-19 02:46:06,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:46:06,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 832.45 | bwd_microstep: 1580.42 | bwd_inner_microstep: 1572.69 | bwd_allreduce_microstep: 7.62 | step_microstep: 9.56
[2025-05-19 02:46:06,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1648.13 | bwd: 3143.50 | bwd_inner: 3135.69 | bwd_allreduce: 7.68 | step: 9.66
 71%|███████▏  | 321/450 [27:48<10:37,  4.94s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1227
[2025-05-19 02:46:08,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.75 | bwd_microstep: 1359.36 | bwd_inner_microstep: 1359.22 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1374
[2025-05-19 02:46:11,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.26 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:46:11,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 826.00 | bwd_microstep: 2364.66 | bwd_inner_microstep: 1558.43 | bwd_allreduce_microstep: 806.13 | step_microstep: 10.36
[2025-05-19 02:46:11,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1556.71 | bwd: 3724.04 | bwd_inner: 2917.71 | bwd_allreduce: 806.18 | step: 10.47
 72%|███████▏  | 322/450 [27:53<10:47,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1376
[2025-05-19 02:46:13,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 805.36 | bwd_microstep: 1550.99 | bwd_inner_microstep: 1550.84 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1384
[2025-05-19 02:46:16,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.25 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:46:16,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 826.40 | bwd_microstep: 1576.07 | bwd_inner_microstep: 1568.06 | bwd_allreduce_microstep: 7.93 | step_microstep: 11.57
[2025-05-19 02:46:16,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1631.74 | bwd: 3127.08 | bwd_inner: 3118.96 | bwd_allreduce: 7.98 | step: 11.67
 72%|███████▏  | 323/450 [27:58<10:33,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1270
[2025-05-19 02:46:18,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 749.94 | bwd_microstep: 1407.30 | bwd_inner_microstep: 1407.05 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.16
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1299
[2025-05-19 02:46:21,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.96 | optimizer_gradients: 0.60 | optimizer_step: 0.31
[2025-05-19 02:46:21,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 773.06 | bwd_microstep: 2024.40 | bwd_inner_microstep: 1457.22 | bwd_allreduce_microstep: 567.08 | step_microstep: 9.44
[2025-05-19 02:46:21,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1522.96 | bwd: 3431.73 | bwd_inner: 2864.33 | bwd_allreduce: 567.19 | step: 9.60
 72%|███████▏  | 324/450 [28:03<10:29,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1277
[2025-05-19 02:46:23,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.90 | bwd_microstep: 1416.89 | bwd_inner_microstep: 1416.75 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1342
[2025-05-19 02:46:26,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.95 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:46:26,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 807.08 | bwd_microstep: 1951.52 | bwd_inner_microstep: 1517.92 | bwd_allreduce_microstep: 433.51 | step_microstep: 9.49
[2025-05-19 02:46:26,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1550.96 | bwd: 3368.44 | bwd_inner: 2934.73 | bwd_allreduce: 433.57 | step: 9.59
 72%|███████▏  | 325/450 [28:08<10:23,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1371
[2025-05-19 02:46:28,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 810.10 | bwd_microstep: 1562.19 | bwd_inner_microstep: 1562.02 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1329
[2025-05-19 02:46:30,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.62 | optimizer_step: 0.31
[2025-05-19 02:46:30,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 799.57 | bwd_microstep: 1508.66 | bwd_inner_microstep: 1501.03 | bwd_allreduce_microstep: 7.54 | step_microstep: 9.67
[2025-05-19 02:46:30,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1609.65 | bwd: 3070.87 | bwd_inner: 3063.09 | bwd_allreduce: 7.60 | step: 9.77
 72%|███████▏  | 326/450 [28:13<10:09,  4.91s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1314
[2025-05-19 02:46:33,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 780.53 | bwd_microstep: 1485.62 | bwd_inner_microstep: 1485.46 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1375
[2025-05-19 02:46:35,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.59 | optimizer_step: 0.32
[2025-05-19 02:46:35,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 829.80 | bwd_microstep: 1601.35 | bwd_inner_microstep: 1568.91 | bwd_allreduce_microstep: 32.30 | step_microstep: 10.88
[2025-05-19 02:46:35,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1610.30 | bwd: 3086.99 | bwd_inner: 3054.46 | bwd_allreduce: 32.36 | step: 11.00
 73%|███████▎  | 327/450 [28:17<09:58,  4.86s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1468
[2025-05-19 02:46:38,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 871.12 | bwd_microstep: 1690.14 | bwd_inner_microstep: 1690.00 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1231
[2025-05-19 02:46:40,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:46:40,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 748.87 | bwd_microstep: 1840.45 | bwd_inner_microstep: 1366.89 | bwd_allreduce_microstep: 473.45 | step_microstep: 9.62
[2025-05-19 02:46:40,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1619.96 | bwd: 3530.61 | bwd_inner: 3056.95 | bwd_allreduce: 473.51 | step: 9.73
 73%|███████▎  | 328/450 [28:23<10:05,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:46:43,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 895.69 | bwd_microstep: 1746.25 | bwd_inner_microstep: 1746.12 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1285
[2025-05-19 02:46:46,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.64 | optimizer_step: 0.36
[2025-05-19 02:46:46,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 772.21 | bwd_microstep: 1771.53 | bwd_inner_microstep: 1435.47 | bwd_allreduce_microstep: 335.96 | step_microstep: 9.67
[2025-05-19 02:46:46,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1667.85 | bwd: 3517.81 | bwd_inner: 3181.65 | bwd_allreduce: 336.02 | step: 9.76
 73%|███████▎  | 329/450 [28:28<10:10,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1345
[2025-05-19 02:46:48,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 800.91 | bwd_microstep: 1542.23 | bwd_inner_microstep: 1542.10 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1392
[2025-05-19 02:46:51,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.77 | optimizer_gradients: 0.95 | optimizer_step: 0.35
[2025-05-19 02:46:51,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 830.20 | bwd_microstep: 1918.47 | bwd_inner_microstep: 1575.40 | bwd_allreduce_microstep: 342.94 | step_microstep: 16.21
[2025-05-19 02:46:51,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1631.09 | bwd: 3460.74 | bwd_inner: 3117.57 | bwd_allreduce: 342.99 | step: 16.30
 73%|███████▎  | 330/450 [28:33<10:09,  5.08s/it]                                                 {'loss': 1.1158, 'grad_norm': 0.3896031975746155, 'learning_rate': 7.022000596042196e-06, 'epoch': 0.73}
 73%|███████▎  | 330/450 [28:33<10:09,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1488
[2025-05-19 02:46:53,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 875.84 | bwd_microstep: 1724.42 | bwd_inner_microstep: 1724.28 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1427
[2025-05-19 02:46:56,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.57 | optimizer_step: 0.33
[2025-05-19 02:46:56,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 864.87 | bwd_microstep: 1658.48 | bwd_inner_microstep: 1650.82 | bwd_allreduce_microstep: 7.55 | step_microstep: 9.74
[2025-05-19 02:46:56,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1740.66 | bwd: 3382.92 | bwd_inner: 3375.16 | bwd_allreduce: 7.63 | step: 9.84
 74%|███████▎  | 331/450 [28:38<10:08,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1343
[2025-05-19 02:46:58,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 800.27 | bwd_microstep: 1514.40 | bwd_inner_microstep: 1514.25 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1384
[2025-05-19 02:47:01,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.49 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:47:01,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 826.53 | bwd_microstep: 1814.94 | bwd_inner_microstep: 1575.85 | bwd_allreduce_microstep: 238.98 | step_microstep: 11.59
[2025-05-19 02:47:01,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1626.76 | bwd: 3329.36 | bwd_inner: 3090.19 | bwd_allreduce: 239.04 | step: 11.68
 74%|███████▍  | 332/450 [28:43<09:59,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:47:04,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 891.20 | bwd_microstep: 1746.56 | bwd_inner_microstep: 1746.43 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.13
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1444
[2025-05-19 02:47:06,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.13 | optimizer_gradients: 0.59 | optimizer_step: 0.34
[2025-05-19 02:47:06,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 871.88 | bwd_microstep: 1661.72 | bwd_inner_microstep: 1654.08 | bwd_allreduce_microstep: 7.55 | step_microstep: 9.81
[2025-05-19 02:47:06,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1763.06 | bwd: 3408.30 | bwd_inner: 3400.57 | bwd_allreduce: 7.59 | step: 9.91
 74%|███████▍  | 333/450 [28:48<09:59,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1484
[2025-05-19 02:47:09,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 888.63 | bwd_microstep: 1736.91 | bwd_inner_microstep: 1736.75 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:47:12,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:47:12,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 910.32 | bwd_microstep: 1746.51 | bwd_inner_microstep: 1738.65 | bwd_allreduce_microstep: 7.76 | step_microstep: 11.16
[2025-05-19 02:47:12,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1798.87 | bwd: 3483.44 | bwd_inner: 3475.47 | bwd_allreduce: 7.82 | step: 11.26
 74%|███████▍  | 334/450 [28:54<10:01,  5.19s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1492
[2025-05-19 02:47:14,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 892.96 | bwd_microstep: 1739.39 | bwd_inner_microstep: 1739.23 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1483
[2025-05-19 02:47:17,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:47:17,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 899.15 | bwd_microstep: 1740.23 | bwd_inner_microstep: 1732.59 | bwd_allreduce_microstep: 7.56 | step_microstep: 9.66
[2025-05-19 02:47:17,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1792.08 | bwd: 3479.64 | bwd_inner: 3471.89 | bwd_allreduce: 7.62 | step: 9.76
 74%|███████▍  | 335/450 [28:59<10:01,  5.23s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1315
[2025-05-19 02:47:19,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 783.97 | bwd_microstep: 1487.83 | bwd_inner_microstep: 1487.70 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:47:22,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:47:22,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 910.50 | bwd_microstep: 1775.44 | bwd_inner_microstep: 1746.40 | bwd_allreduce_microstep: 28.95 | step_microstep: 9.50
[2025-05-19 02:47:22,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1694.40 | bwd: 3263.30 | bwd_inner: 3234.15 | bwd_allreduce: 29.01 | step: 9.62
 75%|███████▍  | 336/450 [29:04<09:48,  5.16s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1349
[2025-05-19 02:47:24,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 800.49 | bwd_microstep: 1541.44 | bwd_inner_microstep: 1541.31 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1262
[2025-05-19 02:47:27,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:47:27,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 751.93 | bwd_microstep: 1711.18 | bwd_inner_microstep: 1407.41 | bwd_allreduce_microstep: 303.65 | step_microstep: 10.78
[2025-05-19 02:47:27,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1552.39 | bwd: 3252.64 | bwd_inner: 2948.79 | bwd_allreduce: 303.71 | step: 10.90
 75%|███████▍  | 337/450 [29:09<09:33,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1292
[2025-05-19 02:47:29,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 761.73 | bwd_microstep: 1452.11 | bwd_inner_microstep: 1451.96 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1327
[2025-05-19 02:47:32,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.59 | optimizer_step: 0.31
[2025-05-19 02:47:32,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 799.60 | bwd_microstep: 1995.93 | bwd_inner_microstep: 1503.59 | bwd_allreduce_microstep: 492.24 | step_microstep: 9.47
[2025-05-19 02:47:32,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1561.31 | bwd: 3448.06 | bwd_inner: 2955.60 | bwd_allreduce: 492.30 | step: 9.58
 75%|███████▌  | 338/450 [29:14<09:27,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1474
[2025-05-19 02:47:34,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 879.82 | bwd_microstep: 1724.43 | bwd_inner_microstep: 1724.29 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:47:37,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.09 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:47:37,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 910.13 | bwd_microstep: 1747.05 | bwd_inner_microstep: 1739.45 | bwd_allreduce_microstep: 7.51 | step_microstep: 9.60
[2025-05-19 02:47:37,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1789.92 | bwd: 3471.50 | bwd_inner: 3463.80 | bwd_allreduce: 7.57 | step: 9.70
 75%|███████▌  | 339/450 [29:19<09:30,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:47:40,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 895.08 | bwd_microstep: 1750.87 | bwd_inner_microstep: 1750.51 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1366
[2025-05-19 02:47:42,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.73 | optimizer_gradients: 0.72 | optimizer_step: 0.33
[2025-05-19 02:47:42,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 828.03 | bwd_microstep: 1554.28 | bwd_inner_microstep: 1546.41 | bwd_allreduce_microstep: 7.74 | step_microstep: 11.77
[2025-05-19 02:47:42,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1723.08 | bwd: 3305.18 | bwd_inner: 3297.09 | bwd_allreduce: 7.82 | step: 11.88
 76%|███████▌  | 340/450 [29:24<09:23,  5.13s/it]                                                 {'loss': 1.1109, 'grad_norm': 0.4483751654624939, 'learning_rate': 5.960131600884274e-06, 'epoch': 0.75}
 76%|███████▌  | 340/450 [29:24<09:23,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1429
[2025-05-19 02:47:45,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 849.80 | bwd_microstep: 1644.41 | bwd_inner_microstep: 1644.27 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1478
[2025-05-19 02:47:47,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.07 | optimizer_gradients: 0.66 | optimizer_step: 0.34
[2025-05-19 02:47:47,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 900.40 | bwd_microstep: 1737.50 | bwd_inner_microstep: 1729.79 | bwd_allreduce_microstep: 7.61 | step_microstep: 9.84
[2025-05-19 02:47:47,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1750.17 | bwd: 3381.93 | bwd_inner: 3374.11 | bwd_allreduce: 7.67 | step: 9.94
 76%|███████▌  | 341/450 [29:30<09:20,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1353
[2025-05-19 02:47:50,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 809.98 | bwd_microstep: 1543.56 | bwd_inner_microstep: 1543.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:47:53,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:47:53,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 910.49 | bwd_microstep: 1747.01 | bwd_inner_microstep: 1739.25 | bwd_allreduce_microstep: 7.64 | step_microstep: 11.07
[2025-05-19 02:47:53,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1720.44 | bwd: 3290.60 | bwd_inner: 3282.70 | bwd_allreduce: 7.70 | step: 11.20
 76%|███████▌  | 342/450 [29:35<09:13,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1384
[2025-05-19 02:47:55,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 818.42 | bwd_microstep: 1569.52 | bwd_inner_microstep: 1569.38 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1293
[2025-05-19 02:47:57,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.20 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:47:57,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 777.65 | bwd_microstep: 1744.43 | bwd_inner_microstep: 1451.81 | bwd_allreduce_microstep: 292.51 | step_microstep: 9.89
[2025-05-19 02:47:57,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1596.04 | bwd: 3313.98 | bwd_inner: 3021.24 | bwd_allreduce: 292.58 | step: 9.98
 76%|███████▌  | 343/450 [29:40<09:02,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1351
[2025-05-19 02:48:00,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 803.66 | bwd_microstep: 1553.36 | bwd_inner_microstep: 1553.21 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1296
[2025-05-19 02:48:02,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.62 | optimizer_step: 0.33
[2025-05-19 02:48:02,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 771.02 | bwd_microstep: 1789.83 | bwd_inner_microstep: 1445.61 | bwd_allreduce_microstep: 344.13 | step_microstep: 9.42
[2025-05-19 02:48:02,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1574.66 | bwd: 3343.22 | bwd_inner: 2998.87 | bwd_allreduce: 344.19 | step: 9.52
 76%|███████▋  | 344/450 [29:45<08:54,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1465
[2025-05-19 02:48:05,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 872.23 | bwd_microstep: 1684.74 | bwd_inner_microstep: 1684.59 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1345
[2025-05-19 02:48:08,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:48:08,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 817.45 | bwd_microstep: 1679.58 | bwd_inner_microstep: 1536.97 | bwd_allreduce_microstep: 142.51 | step_microstep: 9.49
[2025-05-19 02:48:08,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1689.65 | bwd: 3364.34 | bwd_inner: 3221.63 | bwd_allreduce: 142.57 | step: 9.60
 77%|███████▋  | 345/450 [29:50<08:51,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1429
[2025-05-19 02:48:10,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 848.72 | bwd_microstep: 1656.93 | bwd_inner_microstep: 1656.77 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:48:13,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.07 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:48:13,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 908.78 | bwd_microstep: 1753.68 | bwd_inner_microstep: 1746.06 | bwd_allreduce_microstep: 7.53 | step_microstep: 9.48
[2025-05-19 02:48:13,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1757.48 | bwd: 3410.63 | bwd_inner: 3402.91 | bwd_allreduce: 7.59 | step: 9.58
 77%|███████▋  | 346/450 [29:55<08:51,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:48:15,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 899.98 | bwd_microstep: 1741.77 | bwd_inner_microstep: 1741.63 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.13
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1241
[2025-05-19 02:48:18,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.75 | optimizer_gradients: 0.68 | optimizer_step: 0.33
[2025-05-19 02:48:18,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 747.42 | bwd_microstep: 1383.23 | bwd_inner_microstep: 1375.53 | bwd_allreduce_microstep: 7.59 | step_microstep: 11.39
[2025-05-19 02:48:18,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1647.38 | bwd: 3125.02 | bwd_inner: 3117.25 | bwd_allreduce: 7.64 | step: 11.50
 77%|███████▋  | 347/450 [30:00<08:37,  5.03s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1379
[2025-05-19 02:48:20,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 822.00 | bwd_microstep: 1578.84 | bwd_inner_microstep: 1578.71 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1461
[2025-05-19 02:48:23,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.76 | optimizer_gradients: 0.70 | optimizer_step: 0.34
[2025-05-19 02:48:23,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 887.21 | bwd_microstep: 1697.35 | bwd_inner_microstep: 1689.22 | bwd_allreduce_microstep: 8.02 | step_microstep: 11.56
[2025-05-19 02:48:23,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1709.18 | bwd: 3276.22 | bwd_inner: 3267.98 | bwd_allreduce: 8.09 | step: 11.67
 77%|███████▋  | 348/450 [30:05<08:33,  5.03s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:48:25,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 895.28 | bwd_microstep: 1747.14 | bwd_inner_microstep: 1747.01 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1455
[2025-05-19 02:48:28,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.18 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:48:28,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 881.03 | bwd_microstep: 1682.72 | bwd_inner_microstep: 1675.12 | bwd_allreduce_microstep: 7.51 | step_microstep: 9.71
[2025-05-19 02:48:28,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1776.28 | bwd: 3429.88 | bwd_inner: 3422.17 | bwd_allreduce: 7.57 | step: 9.81
 78%|███████▊  | 349/450 [30:10<08:35,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1432
[2025-05-19 02:48:30,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 846.61 | bwd_microstep: 1634.20 | bwd_inner_microstep: 1634.06 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1234
[2025-05-19 02:48:33,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.64 | optimizer_step: 0.31
[2025-05-19 02:48:33,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.91 | bwd_microstep: 1523.55 | bwd_inner_microstep: 1368.26 | bwd_allreduce_microstep: 155.20 | step_microstep: 10.72
[2025-05-19 02:48:33,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1592.49 | bwd: 3157.78 | bwd_inner: 3002.38 | bwd_allreduce: 155.26 | step: 10.81
 78%|███████▊  | 350/450 [30:15<08:21,  5.01s/it]                                                 {'loss': 1.1378, 'grad_norm': 0.432159960269928, 'learning_rate': 4.971124741004558e-06, 'epoch': 0.78}
 78%|███████▊  | 350/450 [30:15<08:21,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:48:35,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 894.06 | bwd_microstep: 1747.48 | bwd_inner_microstep: 1747.34 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:48:38,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.73 | optimizer_gradients: 0.69 | optimizer_step: 0.35
[2025-05-19 02:48:38,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 912.53 | bwd_microstep: 1753.16 | bwd_inner_microstep: 1745.28 | bwd_allreduce_microstep: 7.78 | step_microstep: 11.32
[2025-05-19 02:48:38,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1806.57 | bwd: 3500.67 | bwd_inner: 3492.67 | bwd_allreduce: 7.84 | step: 11.44
 78%|███████▊  | 351/450 [30:20<08:26,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1323
[2025-05-19 02:48:40,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 785.31 | bwd_microstep: 1492.69 | bwd_inner_microstep: 1492.56 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1388
[2025-05-19 02:48:43,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.96 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:48:43,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 833.99 | bwd_microstep: 1979.10 | bwd_inner_microstep: 1583.35 | bwd_allreduce_microstep: 395.65 | step_microstep: 9.46
[2025-05-19 02:48:43,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1619.26 | bwd: 3471.81 | bwd_inner: 3075.96 | bwd_allreduce: 395.71 | step: 9.58
 78%|███████▊  | 352/450 [30:25<08:22,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1335
[2025-05-19 02:48:46,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 789.28 | bwd_microstep: 1510.84 | bwd_inner_microstep: 1510.71 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1195
[2025-05-19 02:48:48,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:48:48,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.79 | bwd_microstep: 2068.13 | bwd_inner_microstep: 1315.65 | bwd_allreduce_microstep: 752.38 | step_microstep: 9.58
[2025-05-19 02:48:48,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1508.04 | bwd: 3579.00 | bwd_inner: 2826.42 | bwd_allreduce: 752.45 | step: 9.69
 78%|███████▊  | 353/450 [30:31<08:17,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1397
[2025-05-19 02:48:51,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 824.06 | bwd_microstep: 1590.48 | bwd_inner_microstep: 1590.35 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1292
[2025-05-19 02:48:54,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:48:54,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 775.16 | bwd_microstep: 1961.58 | bwd_inner_microstep: 1439.15 | bwd_allreduce_microstep: 522.33 | step_microstep: 9.73
[2025-05-19 02:48:54,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1599.19 | bwd: 3552.08 | bwd_inner: 3029.55 | bwd_allreduce: 522.39 | step: 9.86
 79%|███████▊  | 354/450 [30:36<08:14,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1320
[2025-05-19 02:48:56,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 774.94 | bwd_microstep: 1481.51 | bwd_inner_microstep: 1481.37 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1306
[2025-05-19 02:48:59,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.72 | optimizer_gradients: 0.67 | optimizer_step: 0.32
[2025-05-19 02:48:59,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 781.69 | bwd_microstep: 1975.25 | bwd_inner_microstep: 1456.14 | bwd_allreduce_microstep: 519.01 | step_microstep: 10.02
[2025-05-19 02:48:59,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1556.60 | bwd: 3456.80 | bwd_inner: 2937.56 | bwd_allreduce: 519.08 | step: 10.13
 79%|███████▉  | 355/450 [30:41<08:07,  5.13s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1389
[2025-05-19 02:49:01,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 818.66 | bwd_microstep: 1588.81 | bwd_inner_microstep: 1588.68 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1361
[2025-05-19 02:49:03,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.12 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:49:03,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 821.38 | bwd_microstep: 1550.13 | bwd_inner_microstep: 1542.48 | bwd_allreduce_microstep: 7.55 | step_microstep: 9.74
[2025-05-19 02:49:03,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1640.01 | bwd: 3138.96 | bwd_inner: 3131.20 | bwd_allreduce: 7.61 | step: 9.82
 79%|███████▉  | 356/450 [30:46<07:53,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1455
[2025-05-19 02:49:06,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 866.94 | bwd_microstep: 1668.48 | bwd_inner_microstep: 1668.35 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1366
[2025-05-19 02:49:08,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:49:08,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 826.94 | bwd_microstep: 1565.33 | bwd_inner_microstep: 1555.59 | bwd_allreduce_microstep: 9.65 | step_microstep: 11.24
[2025-05-19 02:49:08,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1693.86 | bwd: 3233.84 | bwd_inner: 3224.00 | bwd_allreduce: 9.71 | step: 11.33
 79%|███████▉  | 357/450 [30:51<07:47,  5.02s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1374
[2025-05-19 02:49:11,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 819.06 | bwd_microstep: 1560.99 | bwd_inner_microstep: 1560.79 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.18
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1367
[2025-05-19 02:49:14,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.73 | optimizer_gradients: 0.74 | optimizer_step: 0.34
[2025-05-19 02:49:14,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 826.38 | bwd_microstep: 2078.08 | bwd_inner_microstep: 1553.67 | bwd_allreduce_microstep: 524.29 | step_microstep: 11.45
[2025-05-19 02:49:14,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1645.40 | bwd: 3639.11 | bwd_inner: 3114.53 | bwd_allreduce: 524.39 | step: 11.61
 80%|███████▉  | 358/450 [30:56<07:50,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1406
[2025-05-19 02:49:16,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 827.24 | bwd_microstep: 1600.16 | bwd_inner_microstep: 1600.02 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1275
[2025-05-19 02:49:19,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:49:19,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 759.87 | bwd_microstep: 1887.01 | bwd_inner_microstep: 1406.74 | bwd_allreduce_microstep: 480.18 | step_microstep: 9.57
[2025-05-19 02:49:19,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1587.08 | bwd: 3487.20 | bwd_inner: 3006.81 | bwd_allreduce: 480.24 | step: 9.67
 80%|███████▉  | 359/450 [31:01<07:46,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1338
[2025-05-19 02:49:21,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 791.38 | bwd_microstep: 1518.44 | bwd_inner_microstep: 1518.30 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:49:24,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.60 | optimizer_step: 0.31
[2025-05-19 02:49:24,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 910.03 | bwd_microstep: 1805.48 | bwd_inner_microstep: 1741.52 | bwd_allreduce_microstep: 63.83 | step_microstep: 9.37
[2025-05-19 02:49:24,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1701.37 | bwd: 3323.94 | bwd_inner: 3259.90 | bwd_allreduce: 63.90 | step: 9.51
 80%|████████  | 360/450 [31:06<07:39,  5.11s/it]                                                 {'loss': 1.0839, 'grad_norm': 0.3847085237503052, 'learning_rate': 4.060112625168848e-06, 'epoch': 0.8}
 80%|████████  | 360/450 [31:06<07:39,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1343
[2025-05-19 02:49:26,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 793.93 | bwd_microstep: 1521.37 | bwd_inner_microstep: 1521.22 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1426
[2025-05-19 02:49:29,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.62 | optimizer_step: 0.31
[2025-05-19 02:49:29,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 859.04 | bwd_microstep: 1783.03 | bwd_inner_microstep: 1638.81 | bwd_allreduce_microstep: 144.12 | step_microstep: 9.38
[2025-05-19 02:49:29,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1652.94 | bwd: 3304.43 | bwd_inner: 3160.09 | bwd_allreduce: 144.18 | step: 9.48
 80%|████████  | 361/450 [31:11<07:32,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1420
[2025-05-19 02:49:32,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 844.80 | bwd_microstep: 1633.01 | bwd_inner_microstep: 1632.85 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1274
[2025-05-19 02:49:34,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:49:34,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 760.68 | bwd_microstep: 1508.87 | bwd_inner_microstep: 1407.90 | bwd_allreduce_microstep: 100.88 | step_microstep: 9.51
[2025-05-19 02:49:34,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1605.46 | bwd: 3141.90 | bwd_inner: 3040.81 | bwd_allreduce: 100.94 | step: 9.61
 80%|████████  | 362/450 [31:16<07:19,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1491
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
WARNING: tokenization mismatch: 705 vs. 704. #turn = 10. (ignored). This dataset is finetune-ocr-dataset.
[2025-05-19 02:49:36,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 891.50 | bwd_microstep: 1739.84 | bwd_inner_microstep: 1739.70 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1263
[2025-05-19 02:49:39,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:49:39,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 761.42 | bwd_microstep: 1479.90 | bwd_inner_microstep: 1404.17 | bwd_allreduce_microstep: 75.62 | step_microstep: 9.57
[2025-05-19 02:49:39,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1652.88 | bwd: 3219.76 | bwd_inner: 3143.91 | bwd_allreduce: 75.68 | step: 9.69
 81%|████████  | 363/450 [31:21<07:12,  4.98s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:49:41,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 895.00 | bwd_microstep: 1746.85 | bwd_inner_microstep: 1746.71 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1349
[2025-05-19 02:49:44,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.73 | optimizer_gradients: 0.72 | optimizer_step: 0.33
[2025-05-19 02:49:44,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 821.72 | bwd_microstep: 1548.16 | bwd_inner_microstep: 1540.49 | bwd_allreduce_microstep: 7.56 | step_microstep: 11.43
[2025-05-19 02:49:44,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1716.69 | bwd: 3295.03 | bwd_inner: 3287.24 | bwd_allreduce: 7.62 | step: 11.55
 81%|████████  | 364/450 [31:26<07:10,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1483
[2025-05-19 02:49:46,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 888.78 | bwd_microstep: 1722.14 | bwd_inner_microstep: 1722.01 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1305
[2025-05-19 02:49:49,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.63 | optimizer_step: 0.34
[2025-05-19 02:49:49,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 787.87 | bwd_microstep: 1463.99 | bwd_inner_microstep: 1456.28 | bwd_allreduce_microstep: 7.60 | step_microstep: 9.69
[2025-05-19 02:49:49,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1676.62 | bwd: 3186.15 | bwd_inner: 3178.36 | bwd_allreduce: 7.66 | step: 9.79
 81%|████████  | 365/450 [31:31<07:03,  4.98s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1340
[2025-05-19 02:49:51,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 800.31 | bwd_microstep: 1520.01 | bwd_inner_microstep: 1519.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1357
[2025-05-19 02:49:54,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.92 | optimizer_gradients: 0.66 | optimizer_step: 0.32
[2025-05-19 02:49:54,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 821.44 | bwd_microstep: 2037.78 | bwd_inner_microstep: 1553.73 | bwd_allreduce_microstep: 483.95 | step_microstep: 9.58
[2025-05-19 02:49:54,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1621.72 | bwd: 3557.81 | bwd_inner: 3073.67 | bwd_allreduce: 484.01 | step: 9.67
 81%|████████▏ | 366/450 [31:36<07:04,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1237
[2025-05-19 02:49:56,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.48 | bwd_microstep: 1386.02 | bwd_inner_microstep: 1385.89 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:49:59,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.95 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:49:59,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 909.36 | bwd_microstep: 2093.41 | bwd_inner_microstep: 1749.42 | bwd_allreduce_microstep: 343.90 | step_microstep: 9.52
[2025-05-19 02:49:59,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1641.81 | bwd: 3479.46 | bwd_inner: 3135.37 | bwd_allreduce: 343.96 | step: 9.63
 82%|████████▏ | 367/450 [31:41<07:02,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:50:02,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 893.50 | bwd_microstep: 1745.53 | bwd_inner_microstep: 1745.39 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1405
[2025-05-19 02:50:04,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:50:04,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 847.21 | bwd_microstep: 1608.89 | bwd_inner_microstep: 1601.02 | bwd_allreduce_microstep: 7.78 | step_microstep: 9.61
[2025-05-19 02:50:04,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1740.67 | bwd: 3354.45 | bwd_inner: 3346.46 | bwd_allreduce: 7.84 | step: 9.71
 82%|████████▏ | 368/450 [31:46<06:58,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1333
[2025-05-19 02:50:07,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 798.27 | bwd_microstep: 1510.96 | bwd_inner_microstep: 1510.80 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1391
[2025-05-19 02:50:09,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:50:09,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 838.28 | bwd_microstep: 1679.09 | bwd_inner_microstep: 1579.87 | bwd_allreduce_microstep: 99.11 | step_microstep: 9.46
[2025-05-19 02:50:09,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1636.51 | bwd: 3190.08 | bwd_inner: 3090.75 | bwd_allreduce: 99.17 | step: 9.56
 82%|████████▏ | 369/450 [31:51<06:48,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1291
[2025-05-19 02:50:11,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 761.39 | bwd_microstep: 1454.92 | bwd_inner_microstep: 1454.78 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1462
[2025-05-19 02:50:14,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.63 | optimizer_step: 0.33
[2025-05-19 02:50:14,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 883.78 | bwd_microstep: 2046.23 | bwd_inner_microstep: 1687.29 | bwd_allreduce_microstep: 358.84 | step_microstep: 9.55
[2025-05-19 02:50:14,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1645.14 | bwd: 3501.19 | bwd_inner: 3142.12 | bwd_allreduce: 358.91 | step: 9.65
 82%|████████▏ | 370/450 [31:57<06:46,  5.09s/it]                                                 {'loss': 1.0634, 'grad_norm': 0.3914535343647003, 'learning_rate': 3.2318230959848517e-06, 'epoch': 0.82}
 82%|████████▏ | 370/450 [31:57<06:46,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1295
[2025-05-19 02:50:17,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 761.53 | bwd_microstep: 1459.52 | bwd_inner_microstep: 1459.38 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1336
[2025-05-19 02:50:19,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.62 | optimizer_step: 0.34
[2025-05-19 02:50:19,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 799.20 | bwd_microstep: 2014.93 | bwd_inner_microstep: 1507.80 | bwd_allreduce_microstep: 507.03 | step_microstep: 10.95
[2025-05-19 02:50:19,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1560.71 | bwd: 3474.48 | bwd_inner: 2967.25 | bwd_allreduce: 507.09 | step: 11.05
 82%|████████▏ | 371/450 [32:02<06:42,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1346
[2025-05-19 02:50:22,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 798.64 | bwd_microstep: 1545.30 | bwd_inner_microstep: 1544.98 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1375
[2025-05-19 02:50:24,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:50:24,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 824.93 | bwd_microstep: 1742.91 | bwd_inner_microstep: 1565.47 | bwd_allreduce_microstep: 177.34 | step_microstep: 9.50
[2025-05-19 02:50:24,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1623.54 | bwd: 3288.25 | bwd_inner: 3110.54 | bwd_allreduce: 177.53 | step: 9.62
 83%|████████▎ | 372/450 [32:07<06:34,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1441
[2025-05-19 02:50:27,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 853.26 | bwd_microstep: 1658.56 | bwd_inner_microstep: 1658.43 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:50:30,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.21 | optimizer_gradients: 0.59 | optimizer_step: 0.33
[2025-05-19 02:50:30,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 903.52 | bwd_microstep: 1753.94 | bwd_inner_microstep: 1746.32 | bwd_allreduce_microstep: 7.53 | step_microstep: 9.68
[2025-05-19 02:50:30,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1756.76 | bwd: 3412.52 | bwd_inner: 3404.80 | bwd_allreduce: 7.59 | step: 9.79
 83%|████████▎ | 373/450 [32:12<06:32,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1381
[2025-05-19 02:50:32,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 821.61 | bwd_microstep: 1575.07 | bwd_inner_microstep: 1574.93 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.13
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1283
[2025-05-19 02:50:35,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.99 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:50:35,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 771.72 | bwd_microstep: 1770.44 | bwd_inner_microstep: 1437.23 | bwd_allreduce_microstep: 333.08 | step_microstep: 9.53
[2025-05-19 02:50:35,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1593.28 | bwd: 3345.53 | bwd_inner: 3012.22 | bwd_allreduce: 333.14 | step: 9.67
 83%|████████▎ | 374/450 [32:17<06:25,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1485
[2025-05-19 02:50:37,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 886.98 | bwd_microstep: 1734.43 | bwd_inner_microstep: 1734.29 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1342
[2025-05-19 02:50:40,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.96 | optimizer_gradients: 0.66 | optimizer_step: 0.32
[2025-05-19 02:50:40,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 808.74 | bwd_microstep: 1646.81 | bwd_inner_microstep: 1509.78 | bwd_allreduce_microstep: 136.94 | step_microstep: 9.88
[2025-05-19 02:50:40,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1695.69 | bwd: 3381.26 | bwd_inner: 3244.13 | bwd_allreduce: 137.00 | step: 9.97
 83%|████████▎ | 375/450 [32:22<06:21,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1430
[2025-05-19 02:50:42,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 846.88 | bwd_microstep: 1644.07 | bwd_inner_microstep: 1643.94 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1368
[2025-05-19 02:50:45,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:50:45,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 818.42 | bwd_microstep: 1761.10 | bwd_inner_microstep: 1551.35 | bwd_allreduce_microstep: 209.66 | step_microstep: 9.51
[2025-05-19 02:50:45,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1665.26 | bwd: 3405.19 | bwd_inner: 3195.34 | bwd_allreduce: 209.72 | step: 9.62
 84%|████████▎ | 376/450 [32:27<06:17,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1476
[2025-05-19 02:50:48,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 881.19 | bwd_microstep: 1731.82 | bwd_inner_microstep: 1731.69 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1454
[2025-05-19 02:50:50,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.64 | optimizer_step: 0.36
[2025-05-19 02:50:50,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 876.44 | bwd_microstep: 1677.96 | bwd_inner_microstep: 1670.31 | bwd_allreduce_microstep: 7.54 | step_microstep: 9.76
[2025-05-19 02:50:50,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1757.60 | bwd: 3409.80 | bwd_inner: 3402.05 | bwd_allreduce: 7.61 | step: 9.86
 84%|████████▍ | 377/450 [32:32<06:14,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1361
[2025-05-19 02:50:53,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 812.99 | bwd_microstep: 1561.31 | bwd_inner_microstep: 1561.16 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1384
[2025-05-19 02:50:55,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.62 | optimizer_step: 0.31
[2025-05-19 02:50:55,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 824.91 | bwd_microstep: 1884.40 | bwd_inner_microstep: 1577.61 | bwd_allreduce_microstep: 306.70 | step_microstep: 9.55
[2025-05-19 02:50:55,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1637.86 | bwd: 3445.74 | bwd_inner: 3138.82 | bwd_allreduce: 306.77 | step: 9.66
 84%|████████▍ | 378/450 [32:37<06:09,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1278
[2025-05-19 02:50:57,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 752.73 | bwd_microstep: 1410.49 | bwd_inner_microstep: 1410.33 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:51:00,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.13 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:51:00,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 908.73 | bwd_microstep: 1749.94 | bwd_inner_microstep: 1742.34 | bwd_allreduce_microstep: 7.51 | step_microstep: 9.57
[2025-05-19 02:51:00,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1661.43 | bwd: 3160.46 | bwd_inner: 3152.74 | bwd_allreduce: 7.59 | step: 9.65
 84%|████████▍ | 379/450 [32:42<05:59,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1434
[2025-05-19 02:51:03,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 852.11 | bwd_microstep: 1646.15 | bwd_inner_microstep: 1646.01 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1379
[2025-05-19 02:51:05,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.72 | optimizer_gradients: 0.68 | optimizer_step: 0.33
[2025-05-19 02:51:05,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 833.96 | bwd_microstep: 1588.57 | bwd_inner_microstep: 1580.76 | bwd_allreduce_microstep: 7.71 | step_microstep: 11.16
[2025-05-19 02:51:05,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1686.05 | bwd: 3234.74 | bwd_inner: 3226.83 | bwd_allreduce: 7.77 | step: 11.25
 84%|████████▍ | 380/450 [32:47<05:52,  5.03s/it]                                                 {'loss': 1.0979, 'grad_norm': 0.3810326159000397, 'learning_rate': 2.4905546940093086e-06, 'epoch': 0.84}
 84%|████████▍ | 380/450 [32:47<05:52,  5.03s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1398
[2025-05-19 02:51:08,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 829.65 | bwd_microstep: 1595.73 | bwd_inner_microstep: 1595.60 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1297
[2025-05-19 02:51:10,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:51:10,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 778.16 | bwd_microstep: 1678.11 | bwd_inner_microstep: 1458.77 | bwd_allreduce_microstep: 219.24 | step_microstep: 9.52
[2025-05-19 02:51:10,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1607.78 | bwd: 3273.86 | bwd_inner: 3054.42 | bwd_allreduce: 219.30 | step: 9.61
 85%|████████▍ | 381/450 [32:52<05:45,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1342
[2025-05-19 02:51:12,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 790.60 | bwd_microstep: 1509.38 | bwd_inner_microstep: 1509.25 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1262
[2025-05-19 02:51:15,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.08 | optimizer_gradients: 0.64 | optimizer_step: 0.31
[2025-05-19 02:51:15,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 759.83 | bwd_microstep: 1672.36 | bwd_inner_microstep: 1396.43 | bwd_allreduce_microstep: 275.83 | step_microstep: 10.76
[2025-05-19 02:51:15,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1550.40 | bwd: 3181.76 | bwd_inner: 2905.73 | bwd_allreduce: 275.87 | step: 10.86
 85%|████████▍ | 382/450 [32:57<05:35,  4.94s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1449
[2025-05-19 02:51:17,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 860.76 | bwd_microstep: 1666.85 | bwd_inner_microstep: 1666.72 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:51:20,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.10 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:51:20,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 907.70 | bwd_microstep: 1748.74 | bwd_inner_microstep: 1741.15 | bwd_allreduce_microstep: 7.51 | step_microstep: 9.58
[2025-05-19 02:51:20,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1768.44 | bwd: 3415.61 | bwd_inner: 3407.91 | bwd_allreduce: 7.56 | step: 9.68
 85%|████████▌ | 383/450 [33:02<05:36,  5.03s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1302
[2025-05-19 02:51:22,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 775.11 | bwd_microstep: 1452.27 | bwd_inner_microstep: 1452.12 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1359
[2025-05-19 02:51:25,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.71 | optimizer_gradients: 0.70 | optimizer_step: 0.33
[2025-05-19 02:51:25,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 822.77 | bwd_microstep: 1558.02 | bwd_inner_microstep: 1550.23 | bwd_allreduce_microstep: 7.69 | step_microstep: 11.24
[2025-05-19 02:51:25,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1597.82 | bwd: 3010.31 | bwd_inner: 3002.40 | bwd_allreduce: 7.75 | step: 11.34
 85%|████████▌ | 384/450 [33:07<05:24,  4.92s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1349
[2025-05-19 02:51:27,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 804.42 | bwd_microstep: 1546.53 | bwd_inner_microstep: 1546.39 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1324
[2025-05-19 02:51:30,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.61 | optimizer_step: 0.33
[2025-05-19 02:51:30,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 798.35 | bwd_microstep: 1589.40 | bwd_inner_microstep: 1487.35 | bwd_allreduce_microstep: 101.95 | step_microstep: 9.44
[2025-05-19 02:51:30,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1602.74 | bwd: 3135.95 | bwd_inner: 3033.79 | bwd_allreduce: 102.02 | step: 9.53
 86%|████████▌ | 385/450 [33:12<05:17,  4.88s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:51:32,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 895.14 | bwd_microstep: 1742.06 | bwd_inner_microstep: 1741.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1364
[2025-05-19 02:51:35,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.61 | optimizer_step: 0.31
[2025-05-19 02:51:35,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 830.02 | bwd_microstep: 1563.07 | bwd_inner_microstep: 1555.26 | bwd_allreduce_microstep: 7.72 | step_microstep: 9.40
[2025-05-19 02:51:35,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1725.12 | bwd: 3305.15 | bwd_inner: 3297.23 | bwd_allreduce: 7.78 | step: 9.51
 86%|████████▌ | 386/450 [33:17<05:16,  4.94s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1459
[2025-05-19 02:51:37,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 874.95 | bwd_microstep: 1683.98 | bwd_inner_microstep: 1683.85 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1316
[2025-05-19 02:51:40,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.10 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:51:40,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 794.85 | bwd_microstep: 1659.60 | bwd_inner_microstep: 1482.28 | bwd_allreduce_microstep: 177.22 | step_microstep: 10.96
[2025-05-19 02:51:40,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1669.77 | bwd: 3343.61 | bwd_inner: 3166.18 | bwd_allreduce: 177.28 | step: 11.05
 86%|████████▌ | 387/450 [33:22<05:13,  4.98s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1417
[2025-05-19 02:51:42,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 840.70 | bwd_microstep: 1632.02 | bwd_inner_microstep: 1631.89 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1390
[2025-05-19 02:51:45,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:51:45,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 837.91 | bwd_microstep: 1710.73 | bwd_inner_microstep: 1581.02 | bwd_allreduce_microstep: 129.62 | step_microstep: 10.03
[2025-05-19 02:51:45,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1678.58 | bwd: 3342.77 | bwd_inner: 3212.96 | bwd_allreduce: 129.68 | step: 10.13
 86%|████████▌ | 388/450 [33:27<05:10,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:51:47,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 895.32 | bwd_microstep: 1751.86 | bwd_inner_microstep: 1751.73 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1328
[2025-05-19 02:51:50,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.61 | optimizer_step: 0.35
[2025-05-19 02:51:50,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 797.34 | bwd_microstep: 1643.35 | bwd_inner_microstep: 1489.30 | bwd_allreduce_microstep: 153.94 | step_microstep: 9.87
[2025-05-19 02:51:50,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1692.63 | bwd: 3395.24 | bwd_inner: 3241.11 | bwd_allreduce: 153.99 | step: 9.97
 86%|████████▋ | 389/450 [33:32<05:07,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1378
[2025-05-19 02:51:52,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 818.07 | bwd_microstep: 1576.12 | bwd_inner_microstep: 1575.99 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1380
[2025-05-19 02:51:55,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.79 | optimizer_gradients: 0.62 | optimizer_step: 0.31
[2025-05-19 02:51:55,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 832.58 | bwd_microstep: 2100.19 | bwd_inner_microstep: 1577.67 | bwd_allreduce_microstep: 522.43 | step_microstep: 9.38
[2025-05-19 02:51:55,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1650.62 | bwd: 3676.33 | bwd_inner: 3153.69 | bwd_allreduce: 522.49 | step: 9.48
 87%|████████▋ | 390/450 [33:37<05:08,  5.15s/it]                                                 {'loss': 1.1031, 'grad_norm': 0.441038578748703, 'learning_rate': 1.8401543497865026e-06, 'epoch': 0.87}
 87%|████████▋ | 390/450 [33:37<05:08,  5.15s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1253
[2025-05-19 02:51:57,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.83 | bwd_microstep: 1400.26 | bwd_inner_microstep: 1400.13 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1319
[2025-05-19 02:52:00,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.97 | optimizer_gradients: 0.60 | optimizer_step: 0.31
[2025-05-19 02:52:00,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 800.42 | bwd_microstep: 2131.17 | bwd_inner_microstep: 1487.73 | bwd_allreduce_microstep: 643.34 | step_microstep: 9.42
[2025-05-19 02:52:00,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1535.24 | bwd: 3531.45 | bwd_inner: 2887.92 | bwd_allreduce: 643.40 | step: 9.52
 87%|████████▋ | 391/450 [33:43<05:03,  5.14s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:52:03,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 897.34 | bwd_microstep: 1757.95 | bwd_inner_microstep: 1757.80 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1417
[2025-05-19 02:52:06,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.14 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:52:06,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 860.48 | bwd_microstep: 1638.30 | bwd_inner_microstep: 1630.65 | bwd_allreduce_microstep: 7.57 | step_microstep: 9.51
[2025-05-19 02:52:06,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1757.80 | bwd: 3396.28 | bwd_inner: 3388.50 | bwd_allreduce: 7.65 | step: 9.64
 87%|████████▋ | 392/450 [33:48<04:59,  5.16s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1495
[2025-05-19 02:52:08,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 896.78 | bwd_microstep: 1734.31 | bwd_inner_microstep: 1734.15 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1255
[2025-05-19 02:52:10,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.13 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:52:10,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 759.99 | bwd_microstep: 1437.49 | bwd_inner_microstep: 1394.57 | bwd_allreduce_microstep: 42.83 | step_microstep: 9.52
[2025-05-19 02:52:10,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1656.74 | bwd: 3171.82 | bwd_inner: 3128.77 | bwd_allreduce: 42.89 | step: 9.65
 87%|████████▋ | 393/450 [33:53<04:49,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1384
[2025-05-19 02:52:13,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 810.41 | bwd_microstep: 1570.06 | bwd_inner_microstep: 1569.93 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1220
[2025-05-19 02:52:16,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:52:16,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.16 | bwd_microstep: 1936.49 | bwd_inner_microstep: 1350.94 | bwd_allreduce_microstep: 585.45 | step_microstep: 9.56
[2025-05-19 02:52:16,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1548.55 | bwd: 3506.57 | bwd_inner: 2920.92 | bwd_allreduce: 585.50 | step: 9.66
 88%|████████▊ | 394/450 [33:58<04:44,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1346
[2025-05-19 02:52:18,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 799.15 | bwd_microstep: 1549.26 | bwd_inner_microstep: 1549.10 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1312
[2025-05-19 02:52:21,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.61 | optimizer_step: 0.31
[2025-05-19 02:52:21,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 777.21 | bwd_microstep: 1835.74 | bwd_inner_microstep: 1453.95 | bwd_allreduce_microstep: 381.70 | step_microstep: 9.53
[2025-05-19 02:52:21,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1576.34 | bwd: 3385.03 | bwd_inner: 3003.11 | bwd_allreduce: 381.77 | step: 9.64
 88%|████████▊ | 395/450 [34:03<04:38,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1356
[2025-05-19 02:52:23,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 803.25 | bwd_microstep: 1551.29 | bwd_inner_microstep: 1551.16 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:52:26,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.58 | optimizer_step: 0.32
[2025-05-19 02:52:26,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 906.26 | bwd_microstep: 1746.13 | bwd_inner_microstep: 1738.52 | bwd_allreduce_microstep: 7.51 | step_microstep: 9.44
[2025-05-19 02:52:26,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1709.48 | bwd: 3297.44 | bwd_inner: 3289.74 | bwd_allreduce: 7.57 | step: 9.54
 88%|████████▊ | 396/450 [34:08<04:33,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1333
[2025-05-19 02:52:28,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 792.02 | bwd_microstep: 1510.89 | bwd_inner_microstep: 1510.76 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1443
[2025-05-19 02:52:31,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:52:31,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 870.67 | bwd_microstep: 1673.75 | bwd_inner_microstep: 1664.88 | bwd_allreduce_microstep: 8.78 | step_microstep: 9.39
[2025-05-19 02:52:31,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1662.67 | bwd: 3184.66 | bwd_inner: 3175.69 | bwd_allreduce: 8.84 | step: 9.49
 88%|████████▊ | 397/450 [34:13<04:25,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1443
[2025-05-19 02:52:33,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 859.10 | bwd_microstep: 1667.19 | bwd_inner_microstep: 1667.06 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1345
[2025-05-19 02:52:36,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.16 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:52:36,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 815.19 | bwd_microstep: 1782.19 | bwd_inner_microstep: 1539.08 | bwd_allreduce_microstep: 243.02 | step_microstep: 11.32
[2025-05-19 02:52:36,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1674.26 | bwd: 3449.40 | bwd_inner: 3206.18 | bwd_allreduce: 243.08 | step: 11.42
 88%|████████▊ | 398/450 [34:18<04:23,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1375
[2025-05-19 02:52:38,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 814.66 | bwd_microstep: 1565.93 | bwd_inner_microstep: 1565.78 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1269
[2025-05-19 02:52:41,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.77 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:52:41,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 758.28 | bwd_microstep: 1965.54 | bwd_inner_microstep: 1408.15 | bwd_allreduce_microstep: 557.28 | step_microstep: 9.74
[2025-05-19 02:52:41,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1572.89 | bwd: 3531.50 | bwd_inner: 2973.99 | bwd_allreduce: 557.36 | step: 9.84
 89%|████████▊ | 399/450 [34:23<04:19,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1347
[2025-05-19 02:52:43,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 797.90 | bwd_microstep: 1544.91 | bwd_inner_microstep: 1544.77 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1316
[2025-05-19 02:52:46,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:52:46,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 789.99 | bwd_microstep: 1660.03 | bwd_inner_microstep: 1484.40 | bwd_allreduce_microstep: 175.53 | step_microstep: 9.53
[2025-05-19 02:52:46,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1587.86 | bwd: 3204.96 | bwd_inner: 3029.23 | bwd_allreduce: 175.59 | step: 9.64
 89%|████████▉ | 400/450 [34:28<04:10,  5.02s/it]                                                 {'loss': 1.1141, 'grad_norm': 0.4054405093193054, 'learning_rate': 1.283997419588916e-06, 'epoch': 0.89}
 89%|████████▉ | 400/450 [34:28<04:10,  5.02s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:52:48,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 893.53 | bwd_microstep: 1750.01 | bwd_inner_microstep: 1749.87 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1278
[2025-05-19 02:52:51,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.57 | optimizer_step: 0.32
[2025-05-19 02:52:51,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 767.53 | bwd_microstep: 1510.65 | bwd_inner_microstep: 1413.47 | bwd_allreduce_microstep: 97.07 | step_microstep: 10.20
[2025-05-19 02:52:51,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1661.05 | bwd: 3260.69 | bwd_inner: 3163.40 | bwd_allreduce: 97.14 | step: 10.32
 89%|████████▉ | 401/450 [34:33<04:05,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1356
[2025-05-19 02:52:53,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 803.31 | bwd_microstep: 1547.11 | bwd_inner_microstep: 1546.95 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.14
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1429
[2025-05-19 02:52:56,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.66 | optimizer_step: 0.32
[2025-05-19 02:52:56,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 858.55 | bwd_microstep: 1643.87 | bwd_inner_microstep: 1636.18 | bwd_allreduce_microstep: 7.60 | step_microstep: 9.49
[2025-05-19 02:52:56,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1661.84 | bwd: 3191.01 | bwd_inner: 3183.19 | bwd_allreduce: 7.67 | step: 9.62
 89%|████████▉ | 402/450 [34:38<03:58,  4.98s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1468
[2025-05-19 02:52:58,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 875.63 | bwd_microstep: 1690.56 | bwd_inner_microstep: 1690.43 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1297
[2025-05-19 02:53:01,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.63 | optimizer_step: 0.33
[2025-05-19 02:53:01,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 781.19 | bwd_microstep: 1553.46 | bwd_inner_microstep: 1450.76 | bwd_allreduce_microstep: 102.61 | step_microstep: 9.43
[2025-05-19 02:53:01,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1656.79 | bwd: 3244.04 | bwd_inner: 3141.24 | bwd_allreduce: 102.66 | step: 9.53
 90%|████████▉ | 403/450 [34:43<03:53,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1483
[2025-05-19 02:53:03,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 883.67 | bwd_microstep: 1734.55 | bwd_inner_microstep: 1734.42 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:53:06,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.69 | optimizer_gradients: 0.63 | optimizer_step: 0.33
[2025-05-19 02:53:06,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 912.58 | bwd_microstep: 1747.13 | bwd_inner_microstep: 1739.48 | bwd_allreduce_microstep: 7.53 | step_microstep: 10.86
[2025-05-19 02:53:06,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1796.21 | bwd: 3481.72 | bwd_inner: 3473.96 | bwd_allreduce: 7.61 | step: 10.96
 90%|████████▉ | 404/450 [34:48<03:53,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:53:09,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 904.22 | bwd_microstep: 1738.05 | bwd_inner_microstep: 1737.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1408
[2025-05-19 02:53:11,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.61 | optimizer_step: 0.34
[2025-05-19 02:53:11,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 842.10 | bwd_microstep: 1589.60 | bwd_inner_microstep: 1581.77 | bwd_allreduce_microstep: 7.73 | step_microstep: 9.64
[2025-05-19 02:53:11,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1746.20 | bwd: 3327.68 | bwd_inner: 3319.76 | bwd_allreduce: 7.79 | step: 9.75
 90%|█████████ | 405/450 [34:53<03:49,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1365
[2025-05-19 02:53:13,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 816.24 | bwd_microstep: 1555.06 | bwd_inner_microstep: 1554.93 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1321
[2025-05-19 02:53:16,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.88 | optimizer_gradients: 0.60 | optimizer_step: 0.31
[2025-05-19 02:53:16,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 800.69 | bwd_microstep: 1823.10 | bwd_inner_microstep: 1482.47 | bwd_allreduce_microstep: 340.51 | step_microstep: 9.38
[2025-05-19 02:53:16,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1616.90 | bwd: 3378.18 | bwd_inner: 3037.47 | bwd_allreduce: 340.56 | step: 9.47
 90%|█████████ | 406/450 [34:58<03:43,  5.08s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1288
[2025-05-19 02:53:18,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 752.08 | bwd_microstep: 1435.64 | bwd_inner_microstep: 1435.50 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1269
[2025-05-19 02:53:21,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.92 | optimizer_gradients: 0.62 | optimizer_step: 0.33
[2025-05-19 02:53:21,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 760.27 | bwd_microstep: 1633.81 | bwd_inner_microstep: 1410.82 | bwd_allreduce_microstep: 222.89 | step_microstep: 9.41
[2025-05-19 02:53:21,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1512.31 | bwd: 3069.47 | bwd_inner: 2846.36 | bwd_allreduce: 222.96 | step: 9.51
 90%|█████████ | 407/450 [35:03<03:32,  4.95s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1405
[2025-05-19 02:53:23,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 830.04 | bwd_microstep: 1604.20 | bwd_inner_microstep: 1604.05 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1340
[2025-05-19 02:53:26,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.61 | optimizer_step: 0.31
[2025-05-19 02:53:26,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 808.05 | bwd_microstep: 1808.94 | bwd_inner_microstep: 1516.48 | bwd_allreduce_microstep: 292.37 | step_microstep: 10.73
[2025-05-19 02:53:26,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1638.07 | bwd: 3413.16 | bwd_inner: 3120.58 | bwd_allreduce: 292.44 | step: 10.92
 91%|█████████ | 408/450 [35:08<03:29,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1310
[2025-05-19 02:53:28,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 772.89 | bwd_microstep: 1465.96 | bwd_inner_microstep: 1465.83 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1344
[2025-05-19 02:53:31,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.18 | optimizer_gradients: 0.68 | optimizer_step: 0.32
[2025-05-19 02:53:31,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 804.55 | bwd_microstep: 1817.04 | bwd_inner_microstep: 1498.31 | bwd_allreduce_microstep: 318.64 | step_microstep: 10.97
[2025-05-19 02:53:31,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1577.42 | bwd: 3283.01 | bwd_inner: 2964.18 | bwd_allreduce: 318.70 | step: 11.07
 91%|█████████ | 409/450 [35:13<03:23,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1231
[2025-05-19 02:53:33,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.17 | bwd_microstep: 1373.46 | bwd_inner_microstep: 1373.33 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:53:36,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.07 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:53:36,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 911.42 | bwd_microstep: 1839.64 | bwd_inner_microstep: 1741.47 | bwd_allreduce_microstep: 98.06 | step_microstep: 10.66
[2025-05-19 02:53:36,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1639.57 | bwd: 3213.13 | bwd_inner: 3114.86 | bwd_allreduce: 98.12 | step: 10.75
 91%|█████████ | 410/450 [35:18<03:18,  4.95s/it]                                                 {'loss': 1.1226, 'grad_norm': 0.4000634253025055, 'learning_rate': 8.249701684677558e-07, 'epoch': 0.91}
 91%|█████████ | 410/450 [35:18<03:18,  4.95s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1378
[2025-05-19 02:53:38,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 817.91 | bwd_microstep: 1575.64 | bwd_inner_microstep: 1575.51 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:53:41,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.96 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:53:41,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 909.65 | bwd_microstep: 2041.42 | bwd_inner_microstep: 1747.59 | bwd_allreduce_microstep: 293.72 | step_microstep: 9.49
[2025-05-19 02:53:41,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1727.54 | bwd: 3617.08 | bwd_inner: 3323.15 | bwd_allreduce: 293.78 | step: 9.59
 91%|█████████▏| 411/450 [35:23<03:18,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1418
[2025-05-19 02:53:44,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 840.88 | bwd_microstep: 1631.70 | bwd_inner_microstep: 1631.56 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1434
[2025-05-19 02:53:46,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.07 | optimizer_gradients: 0.60 | optimizer_step: 0.34
[2025-05-19 02:53:46,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 864.45 | bwd_microstep: 1646.71 | bwd_inner_microstep: 1639.03 | bwd_allreduce_microstep: 7.58 | step_microstep: 9.65
[2025-05-19 02:53:46,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1705.29 | bwd: 3278.43 | bwd_inner: 3270.65 | bwd_allreduce: 7.64 | step: 9.77
 92%|█████████▏| 412/450 [35:28<03:12,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1376
[2025-05-19 02:53:49,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 810.94 | bwd_microstep: 1562.21 | bwd_inner_microstep: 1562.08 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1356
[2025-05-19 02:53:51,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.95 | optimizer_gradients: 0.62 | optimizer_step: 0.31
[2025-05-19 02:53:51,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 820.00 | bwd_microstep: 1747.05 | bwd_inner_microstep: 1549.36 | bwd_allreduce_microstep: 197.56 | step_microstep: 9.47
[2025-05-19 02:53:51,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1630.90 | bwd: 3309.29 | bwd_inner: 3111.52 | bwd_allreduce: 197.63 | step: 9.60
 92%|█████████▏| 413/450 [35:33<03:06,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1383
[2025-05-19 02:53:54,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 818.39 | bwd_microstep: 1580.00 | bwd_inner_microstep: 1579.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1338
[2025-05-19 02:53:56,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.72 | optimizer_gradients: 0.67 | optimizer_step: 0.32
[2025-05-19 02:53:56,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 808.39 | bwd_microstep: 1519.29 | bwd_inner_microstep: 1511.49 | bwd_allreduce_microstep: 7.71 | step_microstep: 11.18
[2025-05-19 02:53:56,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1626.75 | bwd: 3099.32 | bwd_inner: 3091.41 | bwd_allreduce: 7.76 | step: 11.29
 92%|█████████▏| 414/450 [35:38<02:58,  4.97s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1365
[2025-05-19 02:53:58,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 815.16 | bwd_microstep: 1554.28 | bwd_inner_microstep: 1554.15 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1376
[2025-05-19 02:54:01,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:54:01,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 820.83 | bwd_microstep: 1885.21 | bwd_inner_microstep: 1550.90 | bwd_allreduce_microstep: 334.22 | step_microstep: 9.51
[2025-05-19 02:54:01,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1635.97 | bwd: 3439.52 | bwd_inner: 3105.11 | bwd_allreduce: 334.28 | step: 9.60
 92%|█████████▏| 415/450 [35:43<02:55,  5.02s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1289
[2025-05-19 02:54:03,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 758.52 | bwd_microstep: 1444.46 | bwd_inner_microstep: 1444.33 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1360
[2025-05-19 02:54:06,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.78 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:54:06,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 816.75 | bwd_microstep: 1892.12 | bwd_inner_microstep: 1536.98 | bwd_allreduce_microstep: 355.04 | step_microstep: 9.45
[2025-05-19 02:54:06,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1575.23 | bwd: 3336.60 | bwd_inner: 2981.35 | bwd_allreduce: 355.10 | step: 9.56
 92%|█████████▏| 416/450 [35:48<02:50,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1380
[2025-05-19 02:54:08,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 817.18 | bwd_microstep: 1582.79 | bwd_inner_microstep: 1582.66 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1380
[2025-05-19 02:54:11,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.97 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:54:11,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 831.54 | bwd_microstep: 1832.87 | bwd_inner_microstep: 1576.13 | bwd_allreduce_microstep: 256.64 | step_microstep: 10.12
[2025-05-19 02:54:11,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1648.69 | bwd: 3415.68 | bwd_inner: 3158.84 | bwd_allreduce: 256.70 | step: 10.22
 93%|█████████▎| 417/450 [35:53<02:46,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1285
[2025-05-19 02:54:13,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 757.20 | bwd_microstep: 1435.57 | bwd_inner_microstep: 1435.44 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1436
[2025-05-19 02:54:16,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:54:16,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 864.80 | bwd_microstep: 1780.10 | bwd_inner_microstep: 1643.00 | bwd_allreduce_microstep: 137.00 | step_microstep: 9.53
[2025-05-19 02:54:16,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1621.97 | bwd: 3215.69 | bwd_inner: 3078.48 | bwd_allreduce: 137.06 | step: 9.65
 93%|█████████▎| 418/450 [35:58<02:39,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1343
[2025-05-19 02:54:18,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 791.88 | bwd_microstep: 1516.56 | bwd_inner_microstep: 1516.43 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1285
[2025-05-19 02:54:21,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.96 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:54:21,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 773.63 | bwd_microstep: 2048.03 | bwd_inner_microstep: 1448.00 | bwd_allreduce_microstep: 599.93 | step_microstep: 9.51
[2025-05-19 02:54:21,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1565.48 | bwd: 3564.61 | bwd_inner: 2964.48 | bwd_allreduce: 600.00 | step: 9.61
 93%|█████████▎| 419/450 [36:03<02:36,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1371
[2025-05-19 02:54:24,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 809.69 | bwd_microstep: 1551.21 | bwd_inner_microstep: 1551.08 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.13
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1063
[2025-05-19 02:54:26,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.80 | optimizer_gradients: 0.64 | optimizer_step: 0.31
[2025-05-19 02:54:26,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 643.12 | bwd_microstep: 1785.14 | bwd_inner_microstep: 1127.68 | bwd_allreduce_microstep: 657.36 | step_microstep: 9.27
[2025-05-19 02:54:26,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1452.79 | bwd: 3336.37 | bwd_inner: 2678.82 | bwd_allreduce: 657.42 | step: 9.38
 93%|█████████▎| 420/450 [36:08<02:29,  4.99s/it]                                                 {'loss': 1.1049, 'grad_norm': 0.5117509365081787, 'learning_rate': 4.654547915203589e-07, 'epoch': 0.93}
 93%|█████████▎| 420/450 [36:08<02:29,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1355
[2025-05-19 02:54:28,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 802.95 | bwd_microstep: 1546.88 | bwd_inner_microstep: 1546.73 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1272
[2025-05-19 02:54:31,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:54:31,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 750.58 | bwd_microstep: 2121.09 | bwd_inner_microstep: 1396.27 | bwd_allreduce_microstep: 724.72 | step_microstep: 9.54
[2025-05-19 02:54:31,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1553.50 | bwd: 3668.00 | bwd_inner: 2943.07 | bwd_allreduce: 724.77 | step: 9.64
 94%|█████████▎| 421/450 [36:13<02:27,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1403
[2025-05-19 02:54:34,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 825.29 | bwd_microstep: 1597.36 | bwd_inner_microstep: 1597.23 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1390
[2025-05-19 02:54:36,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.02 | optimizer_gradients: 0.61 | optimizer_step: 0.31
[2025-05-19 02:54:36,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 834.81 | bwd_microstep: 1592.14 | bwd_inner_microstep: 1584.51 | bwd_allreduce_microstep: 7.52 | step_microstep: 9.40
[2025-05-19 02:54:36,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1660.08 | bwd: 3189.53 | bwd_inner: 3181.79 | bwd_allreduce: 7.58 | step: 9.49
 94%|█████████▍| 422/450 [36:18<02:20,  5.02s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:54:39,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 896.35 | bwd_microstep: 1744.79 | bwd_inner_microstep: 1744.66 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1274
[2025-05-19 02:54:41,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.97 | optimizer_gradients: 0.61 | optimizer_step: 0.31
[2025-05-19 02:54:41,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 760.96 | bwd_microstep: 1522.33 | bwd_inner_microstep: 1406.61 | bwd_allreduce_microstep: 115.62 | step_microstep: 9.44
[2025-05-19 02:54:41,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1657.29 | bwd: 3267.14 | bwd_inner: 3151.32 | bwd_allreduce: 115.68 | step: 9.54
 94%|█████████▍| 423/450 [36:23<02:15,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1298
[2025-05-19 02:54:43,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 764.75 | bwd_microstep: 1454.23 | bwd_inner_microstep: 1454.06 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1348
[2025-05-19 02:54:46,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:54:46,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 814.36 | bwd_microstep: 2031.18 | bwd_inner_microstep: 1543.09 | bwd_allreduce_microstep: 487.99 | step_microstep: 10.70
[2025-05-19 02:54:46,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1579.07 | bwd: 3485.43 | bwd_inner: 2997.21 | bwd_allreduce: 488.06 | step: 10.80
 94%|█████████▍| 424/450 [36:28<02:11,  5.04s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:54:49,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 893.52 | bwd_microstep: 1744.61 | bwd_inner_microstep: 1744.46 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1255
[2025-05-19 02:54:52,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.91 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:54:52,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 757.68 | bwd_microstep: 1866.66 | bwd_inner_microstep: 1398.07 | bwd_allreduce_microstep: 468.50 | step_microstep: 9.46
[2025-05-19 02:54:52,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1651.17 | bwd: 3611.29 | bwd_inner: 3142.59 | bwd_allreduce: 468.57 | step: 9.55
 94%|█████████▍| 425/450 [36:34<02:08,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:54:54,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 892.62 | bwd_microstep: 1744.63 | bwd_inner_microstep: 1744.50 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1464
[2025-05-19 02:54:57,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.03 | optimizer_gradients: 0.62 | optimizer_step: 0.34
[2025-05-19 02:54:57,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 877.05 | bwd_microstep: 1677.05 | bwd_inner_microstep: 1669.38 | bwd_allreduce_microstep: 7.59 | step_microstep: 9.54
[2025-05-19 02:54:57,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1769.65 | bwd: 3421.70 | bwd_inner: 3413.92 | bwd_allreduce: 7.63 | step: 9.66
 95%|█████████▍| 426/450 [36:39<02:03,  5.16s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1349
[2025-05-19 02:54:59,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 807.40 | bwd_microstep: 1545.38 | bwd_inner_microstep: 1545.25 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1315
[2025-05-19 02:55:02,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.95 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:55:02,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 791.23 | bwd_microstep: 2050.32 | bwd_inner_microstep: 1481.14 | bwd_allreduce_microstep: 569.08 | step_microstep: 10.50
[2025-05-19 02:55:02,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1598.61 | bwd: 3595.73 | bwd_inner: 3026.45 | bwd_allreduce: 569.15 | step: 10.60
 95%|█████████▍| 427/450 [36:44<01:59,  5.19s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1273
[2025-05-19 02:55:04,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.39 | bwd_microstep: 1409.20 | bwd_inner_microstep: 1409.07 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:55:07,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.94 | optimizer_gradients: 0.61 | optimizer_step: 0.31
[2025-05-19 02:55:07,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 911.93 | bwd_microstep: 2057.24 | bwd_inner_microstep: 1745.11 | bwd_allreduce_microstep: 312.03 | step_microstep: 9.42
[2025-05-19 02:55:07,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1656.29 | bwd: 3466.47 | bwd_inner: 3154.23 | bwd_allreduce: 312.09 | step: 9.52
 95%|█████████▌| 428/450 [36:49<01:54,  5.18s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1444
[2025-05-19 02:55:10,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 855.34 | bwd_microstep: 1669.95 | bwd_inner_microstep: 1669.81 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1336
[2025-05-19 02:55:12,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.12 | optimizer_gradients: 0.59 | optimizer_step: 0.32
[2025-05-19 02:55:12,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 800.53 | bwd_microstep: 1505.68 | bwd_inner_microstep: 1498.09 | bwd_allreduce_microstep: 7.51 | step_microstep: 9.50
[2025-05-19 02:55:12,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1655.84 | bwd: 3175.65 | bwd_inner: 3167.95 | bwd_allreduce: 7.55 | step: 9.61
 95%|█████████▌| 429/450 [36:54<01:46,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1337
[2025-05-19 02:55:14,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 795.91 | bwd_microstep: 1514.63 | bwd_inner_microstep: 1514.50 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1405
[2025-05-19 02:55:17,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.06 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:55:17,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 843.62 | bwd_microstep: 1606.30 | bwd_inner_microstep: 1598.74 | bwd_allreduce_microstep: 7.47 | step_microstep: 9.48
[2025-05-19 02:55:17,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1639.50 | bwd: 3120.96 | bwd_inner: 3113.29 | bwd_allreduce: 7.54 | step: 9.59
 96%|█████████▌| 430/450 [36:59<01:40,  5.01s/it]                                                 {'loss': 1.0705, 'grad_norm': 0.43059858679771423, 'learning_rate': 2.0731705110895284e-07, 'epoch': 0.95}
 96%|█████████▌| 430/450 [36:59<01:40,  5.01s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:55:20,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 897.77 | bwd_microstep: 1736.06 | bwd_inner_microstep: 1735.93 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1311
[2025-05-19 02:55:22,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:55:22,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 786.08 | bwd_microstep: 1465.96 | bwd_inner_microstep: 1458.36 | bwd_allreduce_microstep: 7.49 | step_microstep: 10.86
[2025-05-19 02:55:22,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1683.83 | bwd: 3202.04 | bwd_inner: 3194.34 | bwd_allreduce: 7.54 | step: 10.96
 96%|█████████▌| 431/450 [37:04<01:34,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1359
[2025-05-19 02:55:24,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 809.84 | bwd_microstep: 1545.66 | bwd_inner_microstep: 1545.51 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1208
[2025-05-19 02:55:27,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.73 | optimizer_gradients: 0.70 | optimizer_step: 0.33
[2025-05-19 02:55:27,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.41 | bwd_microstep: 1699.42 | bwd_inner_microstep: 1322.36 | bwd_allreduce_microstep: 376.95 | step_microstep: 11.41
[2025-05-19 02:55:27,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1533.22 | bwd: 3245.11 | bwd_inner: 2867.95 | bwd_allreduce: 377.02 | step: 11.51
 96%|█████████▌| 432/450 [37:09<01:28,  4.94s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1434
[2025-05-19 02:55:29,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 849.73 | bwd_microstep: 1647.85 | bwd_inner_microstep: 1647.72 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1306
[2025-05-19 02:55:32,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.63 | optimizer_step: 0.33
[2025-05-19 02:55:32,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 780.64 | bwd_microstep: 1747.88 | bwd_inner_microstep: 1455.26 | bwd_allreduce_microstep: 292.53 | step_microstep: 10.76
[2025-05-19 02:55:32,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1630.34 | bwd: 3395.75 | bwd_inner: 3103.03 | bwd_allreduce: 292.56 | step: 10.88
 96%|█████████▌| 433/450 [37:14<01:24,  4.98s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1312
[2025-05-19 02:55:34,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 762.96 | bwd_microstep: 1459.24 | bwd_inner_microstep: 1459.12 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1297
[2025-05-19 02:55:37,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.98 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:55:37,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 781.10 | bwd_microstep: 1848.47 | bwd_inner_microstep: 1448.99 | bwd_allreduce_microstep: 399.38 | step_microstep: 10.47
[2025-05-19 02:55:37,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1544.04 | bwd: 3307.74 | bwd_inner: 2908.17 | bwd_allreduce: 399.44 | step: 10.59
 96%|█████████▋| 434/450 [37:19<01:19,  4.96s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1412
[2025-05-19 02:55:39,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 837.53 | bwd_microstep: 1634.15 | bwd_inner_microstep: 1634.02 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1383
[2025-05-19 02:55:42,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:55:42,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 832.63 | bwd_microstep: 1748.39 | bwd_inner_microstep: 1582.11 | bwd_allreduce_microstep: 166.20 | step_microstep: 9.48
[2025-05-19 02:55:42,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1670.15 | bwd: 3382.56 | bwd_inner: 3216.18 | bwd_allreduce: 166.25 | step: 9.59
 97%|█████████▋| 435/450 [37:24<01:15,  5.00s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1414
[2025-05-19 02:55:44,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 840.17 | bwd_microstep: 1637.16 | bwd_inner_microstep: 1637.03 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1364
[2025-05-19 02:55:47,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.63 | optimizer_step: 0.31
[2025-05-19 02:55:47,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 825.29 | bwd_microstep: 1730.23 | bwd_inner_microstep: 1562.58 | bwd_allreduce_microstep: 167.55 | step_microstep: 9.46
[2025-05-19 02:55:47,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1665.43 | bwd: 3367.42 | bwd_inner: 3199.67 | bwd_allreduce: 167.61 | step: 9.58
 97%|█████████▋| 436/450 [37:29<01:10,  5.03s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1394
[2025-05-19 02:55:49,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 825.09 | bwd_microstep: 1585.16 | bwd_inner_microstep: 1585.01 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1316
[2025-05-19 02:55:52,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:55:52,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 793.02 | bwd_microstep: 1644.58 | bwd_inner_microstep: 1486.86 | bwd_allreduce_microstep: 157.63 | step_microstep: 10.84
[2025-05-19 02:55:52,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1618.09 | bwd: 3229.76 | bwd_inner: 3071.93 | bwd_allreduce: 157.71 | step: 10.93
 97%|█████████▋| 437/450 [37:34<01:04,  4.99s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1414
[2025-05-19 02:55:54,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 838.46 | bwd_microstep: 1628.98 | bwd_inner_microstep: 1628.86 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1328
[2025-05-19 02:55:57,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:55:57,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 795.24 | bwd_microstep: 1996.32 | bwd_inner_microstep: 1490.12 | bwd_allreduce_microstep: 506.08 | step_microstep: 9.76
[2025-05-19 02:55:57,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1633.68 | bwd: 3625.32 | bwd_inner: 3119.04 | bwd_allreduce: 506.14 | step: 9.88
 97%|█████████▋| 438/450 [37:39<01:01,  5.09s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:56:00,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 893.41 | bwd_microstep: 1744.60 | bwd_inner_microstep: 1744.47 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1423
[2025-05-19 02:56:02,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.60 | optimizer_step: 0.32
[2025-05-19 02:56:02,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 864.16 | bwd_microstep: 1652.26 | bwd_inner_microstep: 1644.70 | bwd_allreduce_microstep: 7.46 | step_microstep: 9.39
[2025-05-19 02:56:02,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1757.53 | bwd: 3396.89 | bwd_inner: 3389.24 | bwd_allreduce: 7.52 | step: 9.49
 98%|█████████▊| 439/450 [37:45<00:56,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1427
[2025-05-19 02:56:05,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 850.64 | bwd_microstep: 1644.53 | bwd_inner_microstep: 1644.39 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1446
[2025-05-19 02:56:07,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.14 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:56:07,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 870.60 | bwd_microstep: 1665.37 | bwd_inner_microstep: 1657.78 | bwd_allreduce_microstep: 7.48 | step_microstep: 9.52
[2025-05-19 02:56:07,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1721.23 | bwd: 3309.93 | bwd_inner: 3302.22 | bwd_allreduce: 7.54 | step: 9.63
 98%|█████████▊| 440/450 [37:50<00:51,  5.11s/it]                                                 {'loss': 1.0838, 'grad_norm': 0.38328781723976135, 'learning_rate': 5.1896594189448925e-08, 'epoch': 0.98}
 98%|█████████▊| 440/450 [37:50<00:51,  5.11s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:56:10,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 896.58 | bwd_microstep: 1752.56 | bwd_inner_microstep: 1752.43 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1446
[2025-05-19 02:56:13,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.00 | optimizer_gradients: 0.64 | optimizer_step: 0.32
[2025-05-19 02:56:13,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 871.76 | bwd_microstep: 1814.16 | bwd_inner_microstep: 1668.03 | bwd_allreduce_microstep: 146.04 | step_microstep: 9.53
[2025-05-19 02:56:13,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1768.31 | bwd: 3566.75 | bwd_inner: 3420.51 | bwd_allreduce: 146.10 | step: 9.64
 98%|█████████▊| 441/450 [37:55<00:46,  5.20s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:56:15,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 891.35 | bwd_microstep: 1748.09 | bwd_inner_microstep: 1747.96 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1301
[2025-05-19 02:56:18,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.22 | optimizer_gradients: 0.63 | optimizer_step: 0.32
[2025-05-19 02:56:18,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 778.94 | bwd_microstep: 1461.52 | bwd_inner_microstep: 1453.86 | bwd_allreduce_microstep: 7.56 | step_microstep: 11.08
[2025-05-19 02:56:18,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1670.25 | bwd: 3209.63 | bwd_inner: 3201.86 | bwd_allreduce: 7.61 | step: 11.19
 98%|█████████▊| 442/450 [38:00<00:40,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1435
[2025-05-19 02:56:20,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 854.88 | bwd_microstep: 1646.05 | bwd_inner_microstep: 1645.92 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1369
[2025-05-19 02:56:23,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.07 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:56:23,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 823.40 | bwd_microstep: 1562.25 | bwd_inner_microstep: 1554.60 | bwd_allreduce_microstep: 7.56 | step_microstep: 9.54
[2025-05-19 02:56:23,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1678.25 | bwd: 3208.34 | bwd_inner: 3200.57 | bwd_allreduce: 7.62 | step: 9.64
 98%|█████████▊| 443/450 [38:05<00:35,  5.06s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1413
[2025-05-19 02:56:25,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 841.79 | bwd_microstep: 1622.01 | bwd_inner_microstep: 1621.88 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:56:28,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.71 | optimizer_gradients: 0.70 | optimizer_step: 0.34
[2025-05-19 02:56:28,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 907.44 | bwd_microstep: 1746.96 | bwd_inner_microstep: 1739.29 | bwd_allreduce_microstep: 7.57 | step_microstep: 11.12
[2025-05-19 02:56:28,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1749.20 | bwd: 3369.00 | bwd_inner: 3361.23 | bwd_allreduce: 7.63 | step: 11.24
 99%|█████████▊| 444/450 [38:10<00:30,  5.10s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1369
[2025-05-19 02:56:30,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 816.25 | bwd_microstep: 1547.14 | bwd_inner_microstep: 1547.01 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1297
[2025-05-19 02:56:33,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.05 | optimizer_gradients: 0.62 | optimizer_step: 0.32
[2025-05-19 02:56:33,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 780.34 | bwd_microstep: 1744.62 | bwd_inner_microstep: 1450.93 | bwd_allreduce_microstep: 293.59 | step_microstep: 10.62
[2025-05-19 02:56:33,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1596.57 | bwd: 3291.78 | bwd_inner: 2997.98 | bwd_allreduce: 293.65 | step: 10.74
 99%|█████████▉| 445/450 [38:15<00:25,  5.05s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1434
[2025-05-19 02:56:35,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 848.04 | bwd_microstep: 1643.09 | bwd_inner_microstep: 1642.96 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.09
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1359
[2025-05-19 02:56:38,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.92 | optimizer_gradients: 0.65 | optimizer_step: 0.31
[2025-05-19 02:56:38,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 822.41 | bwd_microstep: 1735.65 | bwd_inner_microstep: 1553.22 | bwd_allreduce_microstep: 182.35 | step_microstep: 9.39
[2025-05-19 02:56:38,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1670.43 | bwd: 3378.77 | bwd_inner: 3196.22 | bwd_allreduce: 182.41 | step: 9.49
 99%|█████████▉| 446/450 [38:20<00:20,  5.07s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1365
[2025-05-19 02:56:40,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 808.71 | bwd_microstep: 1561.63 | bwd_inner_microstep: 1561.49 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1500
[2025-05-19 02:56:43,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.04 | optimizer_gradients: 0.61 | optimizer_step: 0.32
[2025-05-19 02:56:43,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 906.05 | bwd_microstep: 1925.23 | bwd_inner_microstep: 1742.93 | bwd_allreduce_microstep: 182.16 | step_microstep: 9.46
[2025-05-19 02:56:43,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1714.74 | bwd: 3486.89 | bwd_inner: 3304.52 | bwd_allreduce: 182.23 | step: 9.55
 99%|█████████▉| 447/450 [38:25<00:15,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1397
[2025-05-19 02:56:46,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 824.81 | bwd_microstep: 1595.89 | bwd_inner_microstep: 1595.74 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1228
[2025-05-19 02:56:48,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 3.01 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:56:48,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.57 | bwd_microstep: 1913.09 | bwd_inner_microstep: 1362.50 | bwd_allreduce_microstep: 550.50 | step_microstep: 9.60
[2025-05-19 02:56:48,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1566.36 | bwd: 3509.01 | bwd_inner: 2958.30 | bwd_allreduce: 550.57 | step: 9.70
100%|█████████▉| 448/450 [38:30<00:10,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1446
[2025-05-19 02:56:51,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 854.38 | bwd_microstep: 1659.73 | bwd_inner_microstep: 1659.59 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.10
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1298
[2025-05-19 02:56:53,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.97 | optimizer_gradients: 0.65 | optimizer_step: 0.32
[2025-05-19 02:56:53,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 777.81 | bwd_microstep: 1752.82 | bwd_inner_microstep: 1452.73 | bwd_allreduce_microstep: 300.00 | step_microstep: 9.56
[2025-05-19 02:56:53,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1632.18 | bwd: 3412.58 | bwd_inner: 3112.38 | bwd_allreduce: 300.06 | step: 9.65
100%|█████████▉| 449/450 [38:36<00:05,  5.12s/it]dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1481
[2025-05-19 02:56:56,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 883.66 | bwd_microstep: 1730.24 | bwd_inner_microstep: 1730.11 | bwd_allreduce_microstep: 0.05 | step_microstep: 0.11
dynamic ViT batch size: 8, images per sample: 1.0, dynamic token length: 1271
[2025-05-19 02:56:58,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.88 | optimizer_gradients: 0.61 | optimizer_step: 0.31
[2025-05-19 02:56:58,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 762.22 | bwd_microstep: 1419.47 | bwd_inner_microstep: 1411.81 | bwd_allreduce_microstep: 7.57 | step_microstep: 10.52
[2025-05-19 02:56:58,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1645.85 | bwd: 3149.74 | bwd_inner: 3141.97 | bwd_allreduce: 7.63 | step: 10.62
100%|██████████| 450/450 [38:40<00:00,  5.04s/it]                                                 {'loss': 1.1034, 'grad_norm': 0.4046997129917145, 'learning_rate': 0.0, 'epoch': 1.0}
100%|██████████| 450/450 [38:40<00:00,  5.04s/it][INFO|trainer.py:3888] 2025-05-19 02:57:00,038 >> Saving model checkpoint to /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450
[INFO|configuration_utils.py:419] 2025-05-19 02:57:00,042 >> Configuration saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/config.json
[INFO|configuration_utils.py:909] 2025-05-19 02:57:00,042 >> Configuration saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/generation_config.json
[INFO|modeling_utils.py:3042] 2025-05-19 02:57:02,632 >> Model weights saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/model.safetensors
[INFO|tokenization_utils_base.py:2485] 2025-05-19 02:57:02,667 >> tokenizer config file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/tokenizer_config.json
[INFO|tokenization_utils_base.py:2494] 2025-05-19 02:57:02,667 >> Special tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/special_tokens_map.json
[INFO|tokenization_utils_base.py:2545] 2025-05-19 02:57:02,668 >> added tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/added_tokens.json
[2025-05-19 02:57:02,883] [INFO] [logging.py:128:log_dist] [Rank 0] [Torch] Checkpoint global_step450 is about to be saved!
[2025-05-19 02:57:02,905] [INFO] [logging.py:128:log_dist] [Rank 0] Saving model checkpoint: /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/global_step450/mp_rank_00_model_states.pt
[2025-05-19 02:57:02,905] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/global_step450/mp_rank_00_model_states.pt...
[2025-05-19 02:57:04,984] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/global_step450/mp_rank_00_model_states.pt.
[2025-05-19 02:57:04,987] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/global_step450/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2025-05-19 02:57:05,018] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/global_step450/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-05-19 02:57:05,019] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/checkpoint-450/global_step450/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-05-19 02:57:05,020] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step450 is ready now!
[INFO|tokenization_utils_base.py:2485] 2025-05-19 02:57:06,106 >> tokenizer config file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/tokenizer_config.json
[INFO|tokenization_utils_base.py:2494] 2025-05-19 02:57:06,107 >> Special tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/special_tokens_map.json
[INFO|tokenization_utils_base.py:2545] 2025-05-19 02:57:06,107 >> added tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/added_tokens.json
[INFO|trainer.py:2634] 2025-05-19 02:57:06,319 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
                                                 {'train_runtime': 2329.597, 'train_samples_per_second': 12.374, 'train_steps_per_second': 0.193, 'train_loss': 1.1245052083333333, 'epoch': 1.0}
100%|██████████| 450/450 [38:48<00:00,  5.04s/it]100%|██████████| 450/450 [38:48<00:00,  5.17s/it]
[INFO|trainer.py:4669] 2025-05-19 02:57:06,326 >> Waiting for the current checkpoint push to be finished, this might take a couple of minutes.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
Replace train sampler!!
petrel_client is not installed. Using PIL to load images.
[INFO|trainer.py:3888] 2025-05-19 02:57:51,976 >> Saving model checkpoint to /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR
[INFO|configuration_utils.py:419] 2025-05-19 02:57:51,980 >> Configuration saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/config.json
[INFO|configuration_utils.py:909] 2025-05-19 02:57:51,980 >> Configuration saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/generation_config.json
[INFO|modeling_utils.py:3042] 2025-05-19 02:57:55,707 >> Model weights saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/model.safetensors
[INFO|tokenization_utils_base.py:2485] 2025-05-19 02:57:55,710 >> tokenizer config file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/tokenizer_config.json
[INFO|tokenization_utils_base.py:2494] 2025-05-19 02:57:55,711 >> Special tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/special_tokens_map.json
[INFO|tokenization_utils_base.py:2545] 2025-05-19 02:57:55,711 >> added tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/added_tokens.json
[INFO|trainer.py:3888] 2025-05-19 02:57:56,380 >> Saving model checkpoint to /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR
[INFO|configuration_utils.py:419] 2025-05-19 02:57:56,384 >> Configuration saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/config.json
[INFO|configuration_utils.py:909] 2025-05-19 02:57:56,384 >> Configuration saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/generation_config.json
[INFO|modeling_utils.py:3042] 2025-05-19 02:58:00,185 >> Model weights saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/model.safetensors
[INFO|tokenization_utils_base.py:2485] 2025-05-19 02:58:00,188 >> tokenizer config file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/tokenizer_config.json
[INFO|tokenization_utils_base.py:2494] 2025-05-19 02:58:00,189 >> Special tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/special_tokens_map.json
[INFO|tokenization_utils_base.py:2545] 2025-05-19 02:58:00,189 >> added tokens file saved in /kaggle/working/work_dirs/internvl_chat_v2_0/Finetune_lora_OCR/added_tokens.json