|
|
2: W0902 18:42:01.546000 3691206 torch/distributed/run.py:792] |
|
|
2: W0902 18:42:01.546000 3691206 torch/distributed/run.py:792] ***************************************** |
|
|
2: W0902 18:42:01.546000 3691206 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
|
2: W0902 18:42:01.546000 3691206 torch/distributed/run.py:792] ***************************************** |
|
|
0: W0902 18:42:05.937000 1478709 torch/distributed/run.py:792] |
|
|
0: W0902 18:42:05.937000 1478709 torch/distributed/run.py:792] ***************************************** |
|
|
0: W0902 18:42:05.937000 1478709 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
|
0: W0902 18:42:05.937000 1478709 torch/distributed/run.py:792] ***************************************** |
|
|
3: W0902 18:42:05.939000 368050 torch/distributed/run.py:792] |
|
|
3: W0902 18:42:05.939000 368050 torch/distributed/run.py:792] ***************************************** |
|
|
3: W0902 18:42:05.939000 368050 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
|
3: W0902 18:42:05.939000 368050 torch/distributed/run.py:792] ***************************************** |
|
|
1: W0902 18:42:05.948000 669827 torch/distributed/run.py:792] |
|
|
1: W0902 18:42:05.948000 669827 torch/distributed/run.py:792] ***************************************** |
|
|
1: W0902 18:42:05.948000 669827 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
|
1: W0902 18:42:05.948000 669827 torch/distributed/run.py:792] ***************************************** |
|
|
2: [2025-09-02 18:42:25,579] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:3691281] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing`[39m |
|
|
2: [2025-09-02 18:42:25,580] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:3691281] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing[39m |
|
|
2: [2025-09-02 18:42:29,081] [INFO] [axolotl.utils.data.sft._load_raw_datasets:314] [PID:3691281] [RANK:0] Loading raw datasets...[39m |
|
|
2: [2025-09-02 18:42:29,322] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:88] [PID:3691281] [RANK:0] Loading dataset: /lustre/fswork/projects/rech/qwv/udv55np/dataset/math/hf/no_thinking_text/generator/default-d32b2cae8ea7e541/0.0.0 with base_type: chat_template and prompt_style: None[39m |
|
|
0: [2025-09-02 18:42:33,914] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:1478787] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing`[39m |
|
|
0: [2025-09-02 18:42:33,914] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:1478787] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing[39m |
|
|
1: [2025-09-02 18:42:33,920] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:669903] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing`[39m |
|
|
1: [2025-09-02 18:42:33,920] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:669903] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing[39m |
|
|
3: [2025-09-02 18:42:33,923] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:368126] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing`[39m |
|
|
3: [2025-09-02 18:42:33,924] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:368126] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing[39m |
|
|
0: [2025-09-02 18:42:37,930] [INFO] [axolotl.cli.config.load_cfg:245] [PID:1478787] [RANK:0] config: |
|
|
0: { |
|
|
0: "activation_offloading": false, |
|
|
0: "auto_resume_from_checkpoints": true, |
|
|
0: "axolotl_config_path": "/lustre/fswork/projects/rech/dgo/udv55np/train/tmp/1756826505740523622.yaml", |
|
|
0: "base_model": "/lustre/fswork/projects/rech/qwv/udv55np/Qwen/Qwen2.5-3B_ift", |
|
|
0: "base_model_config": "/lustre/fswork/projects/rech/qwv/udv55np/Qwen/Qwen2.5-3B_ift", |
|
|
0: "batch_size": 16, |
|
|
0: "bf16": true, |
|
|
0: "capabilities": { |
|
|
0: "bf16": true, |
|
|
0: "compute_capability": "sm_90", |
|
|
0: "fp8": false, |
|
|
0: "n_gpu": 16, |
|
|
0: "n_node": 1 |
|
|
0: }, |
|
|
0: "chat_template": "qwen_25", |
|
|
0: "context_parallel_size": 1, |
|
|
0: "dataloader_num_workers": 16, |
|
|
0: "dataloader_pin_memory": true, |
|
|
0: "dataloader_prefetch_factor": 256, |
|
|
0: "dataset_prepared_path": "/lustre/fsn1/projects/rech/dgo/udv55np/dataset_math/Qwen3-235B-A22B/0", |
|
|
0: "dataset_processes": 192, |
|
|
0: "datasets": [ |
|
|
0: { |
|
|
0: "chat_template": "tokenizer_default", |
|
|
0: "field_messages": "conversations", |
|
|
0: "message_property_mappings": { |
|
|
0: "content": "content", |
|
|
0: "role": "role" |
|
|
0: }, |
|
|
0: "path": "/lustre/fswork/projects/rech/qwv/udv55np/dataset/math/hf/no_thinking_text/generator/default-d32b2cae8ea7e541/0.0.0", |
|
|
0: "trust_remote_code": false, |
|
|
0: "type": "chat_template" |
|
|
0: } |
|
|
0: ], |
|
|
0: "ddp": true, |
|
|
0: "deepspeed": { |
|
|
0: "bf16": { |
|
|
0: "enabled": true |
|
|
0: }, |
|
|
0: "gradient_accumulation_steps": "auto", |
|
|
0: "gradient_clipping": "auto", |
|
|
0: "train_batch_size": "auto", |
|
|
0: "train_micro_batch_size_per_gpu": "auto", |
|
|
0: "wall_clock_breakdown": false, |
|
|
0: "zero_optimization": { |
|
|
0: "contiguous_gradients": true, |
|
|
0: "overlap_comm": true, |
|
|
0: "reduce_bucket_size": "auto", |
|
|
0: "stage": 3, |
|
|
0: "stage3_gather_16bit_weights_on_model_save": true, |
|
|
0: "stage3_param_persistence_threshold": "auto", |
|
|
0: "stage3_prefetch_bucket_size": "auto", |
|
|
0: "sub_group_size": 0 |
|
|
0: } |
|
|
0: }, |
|
|
0: "device": "cuda:0", |
|
|
0: "device_map": { |
|
|
0: "": 0 |
|
|
0: }, |
|
|
0: "dion_rank_fraction": 1.0, |
|
|
0: "dion_rank_multiple_of": 1, |
|
|
0: "env_capabilities": { |
|
|
0: "torch_version": "2.6.0" |
|
|
0: }, |
|
|
0: "eval_batch_size": 1, |
|
|
0: "eval_causal_lm_metrics": [ |
|
|
0: "sacrebleu", |
|
|
0: "comet", |
|
|
0: "ter", |
|
|
0: "chrf" |
|
|
0: ], |
|
|
0: "eval_max_new_tokens": 128, |
|
|
0: "eval_sample_packing": true, |
|
|
0: "eval_table_size": 0, |
|
|
0: "evals_per_epoch": 0, |
|
|
0: "flash_attention": true, |
|
|
0: "fp16": false, |
|
|
0: "gradient_accumulation_steps": 1, |
|
|
0: "gradient_checkpointing": true, |
|
|
0: "gradient_checkpointing_kwargs": { |
|
|
0: "use_reentrant": true |
|
|
0: }, |
|
|
0: "learning_rate": 5e-06, |
|
|
0: "lisa_layers_attribute": "model.layers", |
|
|
0: "load_best_model_at_end": false, |
|
|
0: "load_in_4bit": false, |
|
|
0: "load_in_8bit": false, |
|
|
0: "local_rank": 0, |
|
|
0: "logging_steps": 10, |
|
|
0: "lora_dropout": 0.0, |
|
|
0: "loraplus_lr_embedding": 1e-06, |
|
|
0: "lr_scheduler": "warmup_stable_decay", |
|
|
0: "lr_scheduler_kwargs": { |
|
|
0: "min_lr_ratio": 0.1, |
|
|
0: "num_decay_steps": 300 |
|
|
0: }, |
|
|
0: "max_prompt_len": 512, |
|
|
0: "mean_resizing_embeddings": false, |
|
|
0: "micro_batch_size": 1, |
|
|
0: "model_config_type": "qwen2", |
|
|
0: "num_epochs": 1.0, |
|
|
0: "optimizer": "adamw_torch_fused", |
|
|
0: "output_dir": "/lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0", |
|
|
0: "pad_to_sequence_len": true, |
|
|
0: "pretrain_multipack_attn": true, |
|
|
0: "pretrain_multipack_buffer_size": 10000, |
|
|
0: "profiler_steps_start": 0, |
|
|
0: "qlora_sharded_model_loading": false, |
|
|
0: "ray_num_workers": 1, |
|
|
0: "resources_per_worker": { |
|
|
0: "GPU": 1 |
|
|
0: }, |
|
|
0: "sample_packing": true, |
|
|
0: "sample_packing_bin_size": 200, |
|
|
0: "sample_packing_group_size": 100000, |
|
|
0: "save_only_model": false, |
|
|
0: "save_safetensors": true, |
|
|
0: "save_steps": 0.2, |
|
|
0: "save_total_limit": 20, |
|
|
0: "sequence_len": 16384, |
|
|
0: "shuffle_before_merging_datasets": false, |
|
|
0: "shuffle_merged_datasets": true, |
|
|
0: "skip_prepare_dataset": false, |
|
|
0: "special_tokens": { |
|
|
0: "bos_token": "<|im_start|>", |
|
|
0: "eos_token": "<|im_end|>", |
|
|
0: "pad_token": "<|endoftext|>" |
|
|
0: }, |
|
|
0: "strict": false, |
|
|
0: "tensor_parallel_size": 1, |
|
|
0: "tf32": false, |
|
|
0: "tiled_mlp_use_original_mlp": true, |
|
|
0: "tokenizer_config": "/lustre/fswork/projects/rech/qwv/udv55np/Qwen/Qwen2.5-3B_ift", |
|
|
0: "torch_dtype": "torch.bfloat16", |
|
|
0: "train_on_inputs": false, |
|
|
0: "trl": { |
|
|
0: "log_completions": false, |
|
|
0: "mask_truncated_completions": false, |
|
|
0: "ref_model_mixup_alpha": 0.9, |
|
|
0: "ref_model_sync_steps": 64, |
|
|
0: "scale_rewards": true, |
|
|
0: "sync_ref_model": false, |
|
|
0: "use_vllm": false, |
|
|
0: "vllm_server_host": "0.0.0.0", |
|
|
0: "vllm_server_port": 8000 |
|
|
0: }, |
|
|
0: "use_ray": false, |
|
|
0: "use_tensorboard": true, |
|
|
0: "val_set_size": 0.0, |
|
|
0: "vllm": { |
|
|
0: "device": "auto", |
|
|
0: "dtype": "auto", |
|
|
0: "gpu_memory_utilization": 0.9, |
|
|
0: "host": "0.0.0.0", |
|
|
0: "port": 8000 |
|
|
0: }, |
|
|
0: "warmup_steps": 150, |
|
|
0: "weight_decay": 0.0, |
|
|
0: "world_size": 16 |
|
|
0: }[39m |
|
|
0: [2025-09-02 18:42:37,931] [INFO] [axolotl.cli.checks.check_user_token:35] [PID:1478787] [RANK:0] Skipping HuggingFace token verification because HF_HUB_OFFLINE is set to True. Only local files will be used.[39m |
|
|
2:
Tokenizing Prompts (num_proc=192): 0%| | 0/321773 [00:00<?, ? examples/s]
Tokenizing Prompts (num_proc=192): 0%| | 1000/321773 [00:07<41:18, 129.42 examples/s]
Tokenizing Prompts (num_proc=192): 1%| | 2000/321773 [00:08<18:13, 292.55 examples/s]
Tokenizing Prompts (num_proc=192): 1%| | 3000/321773 [00:08<10:21, 513.15 examples/s]
Tokenizing Prompts (num_proc=192): 2%|β | 6000/321773 [00:08<03:41, 1423.90 examples/s]
Tokenizing Prompts (num_proc=192): 2%|β | 7000/321773 [00:08<03:18, 1581.98 examples/s]
Tokenizing Prompts (num_proc=192): 2%|β | 8000/321773 [00:09<02:36, 2010.75 examples/s]
Tokenizing Prompts (num_proc=192): 3%|β | 9000/321773 [00:09<02:11, 2372.83 examples/s]
Tokenizing Prompts (num_proc=192): 3%|β | 11000/321773 [00:09<01:28, 3504.96 examples/s]
Tokenizing Prompts (num_proc=192): 4%|β | 12000/321773 [00:09<01:31, 3397.37 examples/s]
Tokenizing Prompts (num_proc=192): 4%|β |
|
|
2: | 14000/321773 [00:09<01:00, 5073.56 examples/s]
Tokenizing Prompts (num_proc=192): 6%|β | 18000/321773 [00:10<00:32, 9427.11 examples/s]
Tokenizing Prompts (num_proc=192): 7%|β | 21000/321773 [00:10<00:24, 12483.15 examples/s]
Tokenizing Prompts (num_proc=192): 7%|β | 24000/321773 [00:10<00:30, 9707.81 examples/s]
Tokenizing Prompts (num_proc=192): 8%|β | 26000/321773 [00:10<00:35, 8384.78 examples/s]
Tokenizing Prompts (num_proc=192): 9%|β | 28000/321773 [00:11<00:38, 7575.54 examples/s]
Tokenizing Prompts (num_proc=192): 9%|β | 30000/321773 [00:11<00:37, 7742.50 examples/s]
Tokenizing Prompts (num_proc=192): 10%|β | 32000/321773 [00:11<00:31, 9336.47 examples/s]
Tokenizing Prompts (num_proc=192): 11%|β | 34000/321773 [00:11<00:26, 10677.32 examples/s]
Tokenizing Prompts (num_proc=192): 11%|β | 36000/321773 [00:11<00:28, 10117.68 examples/s]
Tokenizing Prompts (num_proc=192): 12%|ββ | |
|
|
2: 38000/321773 [00:12<00:25, 11287.24 examples/s]
Tokenizing Prompts (num_proc=192): 12%|ββ | 39676/321773 [00:12<00:25, 11117.33 examples/s]
Tokenizing Prompts (num_proc=192): 14%|ββ | 43676/321773 [00:12<00:28, 9591.61 examples/s]
Tokenizing Prompts (num_proc=192): 14%|ββ | 45352/321773 [00:12<00:33, 8151.69 examples/s]
Tokenizing Prompts (num_proc=192): 14%|ββ | 46352/321773 [00:13<00:35, 7686.92 examples/s]
Tokenizing Prompts (num_proc=192): 15%|ββ | 47352/321773 [00:13<00:36, 7417.33 examples/s]
Tokenizing Prompts (num_proc=192): 15%|ββ | 48352/321773 [00:13<00:47, 5696.50 examples/s]
Tokenizing Prompts (num_proc=192): 16%|ββ | 52352/321773 [00:13<00:33, 7979.90 examples/s]
Tokenizing Prompts (num_proc=192): 17%|ββ | 55352/321773 [00:14<00:26, 10137.84 examples/s]
Tokenizing Prompts (num_proc=192): 18%|ββ | 57028/321773 [00:14<00:25, 10193.21 examples/s]
Tokenizing Prompts (num_proc=192): 18%|βοΏ½ |
|
|
2: οΏ½οΏ½ | 58704/321773 [00:14<00:25, 10176.02 examples/s]
Tokenizing Prompts (num_proc=192): 19%|ββ | 60056/321773 [00:14<00:24, 10519.60 examples/s]
Tokenizing Prompts (num_proc=192): 19%|ββ | 61732/321773 [00:14<00:32, 7891.54 examples/s]
Tokenizing Prompts (num_proc=192): 20%|ββ | 63732/321773 [00:15<00:32, 7889.18 examples/s]
Tokenizing Prompts (num_proc=192): 21%|ββ | 67760/321773 [00:15<00:24, 10331.09 examples/s]
Tokenizing Prompts (num_proc=192): 23%|βββ | 72788/321773 [00:15<00:18, 13539.21 examples/s]
Tokenizing Prompts (num_proc=192): 24%|βββ | 76492/321773 [00:15<00:17, 13854.25 examples/s]
Tokenizing Prompts (num_proc=192): 25%|βββ | 79844/321773 [00:16<00:17, 13805.67 examples/s]
Tokenizing Prompts (num_proc=192): 26%|βββ | 84520/321773 [00:16<00:15, 15440.07 examples/s]
Tokenizing Prompts (num_proc=192): 28%|βββ | 88872/321773 [00:16<00:14, 16398.34 examples/s]
Tokenizing Prompts |
|
|
2: (num_proc=192): 28%|βββ | 91548/321773 [00:16<00:15, 14601.55 examples/s]
Tokenizing Prompts (num_proc=192): 30%|βββ | 95548/321773 [00:17<00:14, 15426.17 examples/s]
Tokenizing Prompts (num_proc=192): 30%|βββ | 97900/321773 [00:17<00:16, 13621.50 examples/s]
Tokenizing Prompts (num_proc=192): 31%|ββββ | 100928/321773 [00:17<00:16, 13303.97 examples/s]
Tokenizing Prompts (num_proc=192): 33%|ββββ | 105280/321773 [00:17<00:13, 16357.53 examples/s]
Tokenizing Prompts (num_proc=192): 34%|ββββ | 107956/321773 [00:18<00:16, 13043.48 examples/s]
Tokenizing Prompts (num_proc=192): 34%|ββββ | 110308/321773 [00:18<00:15, 13257.97 examples/s]
Tokenizing Prompts (num_proc=192): 35%|ββββ | 112308/321773 [00:18<00:17, 11803.94 examples/s]
Tokenizing Prompts (num_proc=192): 35%|ββββ | 113660/321773 [00:18<00:20, 9973.30 examples/s]
Tokenizing Prompts (num_proc=192): 36%|ββββ | 115012/3217 |
|
|
2: 73 [00:18<00:23, 8677.05 examples/s]
Tokenizing Prompts (num_proc=192): 36%|ββββ | 116364/321773 [00:19<00:26, 7668.05 examples/s]
Tokenizing Prompts (num_proc=192): 37%|ββββ | 119392/321773 [00:19<00:21, 9367.10 examples/s]
Tokenizing Prompts (num_proc=192): 38%|ββββ | 121068/321773 [00:19<00:23, 8643.09 examples/s]
Tokenizing Prompts (num_proc=192): 39%|ββββ | 125420/321773 [00:19<00:16, 11631.24 examples/s]
Tokenizing Prompts (num_proc=192): 39%|ββββ | 127096/321773 [00:20<00:17, 11027.16 examples/s]
Tokenizing Prompts (num_proc=192): 40%|ββββ | 129096/321773 [00:20<00:18, 10344.84 examples/s]
Tokenizing Prompts (num_proc=192): 41%|ββββ | 131772/321773 [00:20<00:18, 10393.89 examples/s]
Tokenizing Prompts (num_proc=192): 42%|βββββ | 134800/321773 [00:20<00:16, 11581.85 examples/s]
Tokenizing Prompts (num_proc=192): 42%|βββββ | 136476/321773 [00:21<00:18, 10007.02 examples/s]
Token |
|
|
2: izing Prompts (num_proc=192): 43%|βββββ | 138152/321773 [00:21<00:24, 7598.79 examples/s]
Tokenizing Prompts (num_proc=192): 44%|βββββ | 142856/321773 [00:21<00:16, 11089.98 examples/s]
Tokenizing Prompts (num_proc=192): 46%|βββββ | 146884/321773 [00:21<00:13, 12887.92 examples/s]
Tokenizing Prompts (num_proc=192): 46%|βββββ | 148912/321773 [00:22<00:14, 11694.80 examples/s]
Tokenizing Prompts (num_proc=192): 47%|βββββ | 152264/321773 [00:22<00:12, 13091.03 examples/s]
Tokenizing Prompts (num_proc=192): 48%|βββββ | 154616/321773 [00:22<00:12, 13770.02 examples/s]
Tokenizing Prompts (num_proc=192): 50%|βββββ | 159644/321773 [00:22<00:08, 18029.58 examples/s]
Tokenizing Prompts (num_proc=192): 51%|βββββ | 163672/321773 [00:22<00:07, 20143.92 examples/s]
Tokenizing Prompts (num_proc=192): 52%|ββββββ | 166348/321773 [00:22<00:07, 19454.88 examples/s]
Tokenizing Prompts (num_proc |
|
|
2: =192): 53%|ββββββ | 169348/321773 [00:23<00:07, 19451.39 examples/s]
Tokenizing Prompts (num_proc=192): 54%|ββββββ | 172700/321773 [00:23<00:07, 19990.67 examples/s]
Tokenizing Prompts (num_proc=192): 54%|ββββββ | 175052/321773 [00:23<00:10, 14583.76 examples/s]
Tokenizing Prompts (num_proc=192): 56%|ββββββ | 180404/321773 [00:23<00:07, 19633.99 examples/s]
Tokenizing Prompts (num_proc=192): 57%|ββββββ | 182756/321773 [00:23<00:07, 18247.45 examples/s]
Tokenizing Prompts (num_proc=192): 58%|ββββββ | 185108/321773 [00:24<00:08, 16406.57 examples/s]
Tokenizing Prompts (num_proc=192): 58%|ββββββ | 187108/321773 [00:24<00:08, 15492.72 examples/s]
Tokenizing Prompts (num_proc=192): 59%|ββββββ | 189460/321773 [00:24<00:08, 15825.05 examples/s]
Tokenizing Prompts (num_proc=192): 59%|ββββββ | 191136/321773 [00:24<00:11, 11283.33 examples/s]
Tokenizing Prompts (num_proc=192): |
|
|
2: 60%|ββββββ | 192488/321773 [00:24<00:12, 10654.24 examples/s]
Tokenizing Prompts (num_proc=192): 60%|ββββββ | 194164/321773 [00:24<00:11, 10992.74 examples/s]
Tokenizing Prompts (num_proc=192): 61%|ββββββ | 196164/321773 [00:25<00:11, 11377.43 examples/s]
Tokenizing Prompts (num_proc=192): 61%|βββββββ | 197516/321773 [00:25<00:14, 8419.88 examples/s]
Tokenizing Prompts (num_proc=192): 62%|βββββββ | 200192/321773 [00:25<00:11, 10506.33 examples/s]
Tokenizing Prompts (num_proc=192): 63%|βββββββ | 201544/321773 [00:25<00:11, 10019.21 examples/s]
Tokenizing Prompts (num_proc=192): 63%|βββββββ | 203544/321773 [00:25<00:10, 10969.36 examples/s]
Tokenizing Prompts (num_proc=192): 64%|βββββββ | 205544/321773 [00:26<00:09, 11724.56 examples/s]
Tokenizing Prompts (num_proc=192): 64%|βββββββ | 206896/321773 [00:26<00:13, 8444.78 examples/s]
Tokenizing Prompts (num_proc=1 |
|
|
2: 92): 65%|βββββββ | 209248/321773 [00:26<00:11, 10195.93 examples/s]
Tokenizing Prompts (num_proc=192): 65%|βββββββ | 210600/321773 [00:26<00:11, 9721.58 examples/s]
Tokenizing Prompts (num_proc=192): 66%|βββββββ | 212952/321773 [00:26<00:08, 12250.35 examples/s]
Tokenizing Prompts (num_proc=192): 67%|βββββββ | 214628/321773 [00:26<00:09, 11123.21 examples/s]
Tokenizing Prompts (num_proc=192): 67%|βββββββ | 216304/321773 [00:27<00:08, 12255.42 examples/s]
Tokenizing Prompts (num_proc=192): 68%|βββββββ | 217980/321773 [00:27<00:08, 12081.44 examples/s]
Tokenizing Prompts (num_proc=192): 68%|βββββββ | 219656/321773 [00:27<00:08, 12479.56 examples/s]
Tokenizing Prompts (num_proc=192): 69%|βββββββ | 222332/321773 [00:27<00:08, 11690.36 examples/s]
Tokenizing Prompts (num_proc=192): 70%|βββββββ | 223684/321773 [00:27<00:08, 11144.39 examples/s]
Tokenizing Prompts |
|
|
2: (num_proc=192): 70%|βββββββ | 226036/321773 [00:27<00:08, 11097.73 examples/s]
Tokenizing Prompts (num_proc=192): 71%|βββββββ | 227388/321773 [00:28<00:10, 9298.28 examples/s]
Tokenizing Prompts (num_proc=192): 71%|βββββββ | 228740/321773 [00:28<00:09, 9665.74 examples/s]
Tokenizing Prompts (num_proc=192): 72%|ββββββββ | 233092/321773 [00:28<00:06, 14319.83 examples/s]
Tokenizing Prompts (num_proc=192): 73%|ββββββββ | 234768/321773 [00:28<00:08, 10359.51 examples/s]
Tokenizing Prompts (num_proc=192): 74%|ββββββββ | 239120/321773 [00:28<00:05, 15792.56 examples/s]
Tokenizing Prompts (num_proc=192): 75%|ββββββββ | 241472/321773 [00:29<00:06, 13247.23 examples/s]
Tokenizing Prompts (num_proc=192): 76%|ββββββββ | 243824/321773 [00:29<00:06, 12162.40 examples/s]
Tokenizing Prompts (num_proc=192): 77%|ββββββββ | 246176/321773 [00:29<00:07, 10770.75 examples/ |
|
|
2: s]
Tokenizing Prompts (num_proc=192): 78%|ββββββββ | 249880/321773 [00:29<00:05, 12175.96 examples/s]
Tokenizing Prompts (num_proc=192): 78%|ββββββββ | 251908/321773 [00:30<00:05, 12287.71 examples/s]
Tokenizing Prompts (num_proc=192): 79%|ββββββββ | 254260/321773 [00:30<00:04, 14138.33 examples/s]
Tokenizing Prompts (num_proc=192): 80%|ββββββββ | 255936/321773 [00:30<00:06, 9574.12 examples/s]
Tokenizing Prompts (num_proc=192): 81%|ββββββββ | 260612/321773 [00:30<00:04, 14495.41 examples/s]
Tokenizing Prompts (num_proc=192): 82%|βββββββββ | 263288/321773 [00:30<00:04, 13285.76 examples/s]
Tokenizing Prompts (num_proc=192): 83%|βββββββββ | 265964/321773 [00:31<00:06, 8573.43 examples/s]
Tokenizing Prompts (num_proc=192): 83%|βββββββββ | 267640/321773 [00:31<00:05, 9476.53 examples/s]
Tokenizing Prompts (num_proc=192): 84%|βββββββββ | 270992/321 |
|
|
2: 773 [00:31<00:04, 10546.11 examples/s]
Tokenizing Prompts (num_proc=192): 85%|βββββββββ | 272668/321773 [00:31<00:04, 10909.82 examples/s]
Tokenizing Prompts (num_proc=192): 85%|βββββββββ | 274020/321773 [00:32<00:04, 10618.59 examples/s]
Tokenizing Prompts (num_proc=192): 86%|βββββββββ | 276020/321773 [00:32<00:04, 9375.64 examples/s]
Tokenizing Prompts (num_proc=192): 87%|βββββββββ | 279048/321773 [00:32<00:03, 12539.94 examples/s]
Tokenizing Prompts (num_proc=192): 88%|βββββββββ | 283400/321773 [00:32<00:02, 13538.36 examples/s]
Tokenizing Prompts (num_proc=192): 89%|βββββββββ | 285400/321773 [00:32<00:02, 14031.28 examples/s]
Tokenizing Prompts (num_proc=192): 89%|βββββββββ | 287076/321773 [00:32<00:02, 14469.57 examples/s]
Tokenizing Prompts (num_proc=192): 90%|βββββββββ | 288752/321773 [00:33<00:02, 14466.47 examples/s]
Tokenizing Prompts (num_proc=192) |
|
|
2: : 90%|βββββββββ | 290752/321773 [00:33<00:02, 11522.66 examples/s]
Tokenizing Prompts (num_proc=192): 91%|ββββββββββ| 294104/321773 [00:33<00:02, 13660.86 examples/s]
Tokenizing Prompts (num_proc=192): 92%|ββββββββββ| 296456/321773 [00:33<00:01, 12702.93 examples/s]
Tokenizing Prompts (num_proc=192): 93%|ββββββββββ| 298132/321773 [00:33<00:02, 11020.78 examples/s]
Tokenizing Prompts (num_proc=192): 93%|ββββββββββ| 299484/321773 [00:34<00:02, 8140.65 examples/s]
Tokenizing Prompts (num_proc=192): 93%|ββββββββββ| 300836/321773 [00:34<00:02, 7014.16 examples/s]
Tokenizing Prompts (num_proc=192): 94%|ββββββββββ| 302188/321773 [00:34<00:03, 5984.59 examples/s]
Tokenizing Prompts (num_proc=192): 94%|ββββββββββ| 303540/321773 [00:35<00:03, 6037.18 examples/s]
Tokenizing Prompts (num_proc=192): 95%|ββββββββββ| 305568/321773 [00:35<0 |
|
|
2: 0:02, 7859.83 examples/s]
Tokenizing Prompts (num_proc=192): 95%|ββββββββββ| 306920/321773 [00:35<00:02, 5864.06 examples/s]
Tokenizing Prompts (num_proc=192): 96%|ββββββββββ| 308272/321773 [00:36<00:03, 4130.62 examples/s]
Tokenizing Prompts (num_proc=192): 96%|ββββββββββ| 309623/321773 [00:36<00:02, 4901.63 examples/s]
Tokenizing Prompts (num_proc=192): 97%|ββββββββββ| 310973/321773 [00:36<00:02, 4353.07 examples/s]
Tokenizing Prompts (num_proc=192): 97%|ββββββββββ| 312323/321773 [00:36<00:01, 5236.88 examples/s]
Tokenizing Prompts (num_proc=192): 97%|ββββββββββ| 313673/321773 [00:37<00:01, 6001.53 examples/s]
Tokenizing Prompts (num_proc=192): 98%|ββββββββββ| 315023/321773 [00:37<00:01, 5721.52 examples/s]
Tokenizing Prompts (num_proc=192): 98%|ββββββββββ| 316373/321773 [00:37<00:01, 5384.63 examples/s]
Tokenizing Prompts (num_proc=192): 99 |
|
|
2: %|ββββββββββ| 317048/321773 [00:37<00:00, 5442.78 examples/s]
Tokenizing Prompts (num_proc=192): 99%|ββββββββββ| 317723/321773 [00:37<00:00, 5129.22 examples/s]
Tokenizing Prompts (num_proc=192): 99%|ββββββββββ| 318398/321773 [00:38<00:00, 4903.01 examples/s]
Tokenizing Prompts (num_proc=192): 100%|ββββββββββ| 321098/321773 [00:38<00:00, 8791.41 examples/s]
Tokenizing Prompts (num_proc=192): 100%|ββββββββββ| 321773/321773 [00:38<00:00, 8276.39 examples/s] |
|
|
2:
Dropping Long Sequences (>16384) (num_proc=192): 0%| | 0/321773 [00:00<?, ? examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 0%| | 1000/321773 [00:01<09:32, 560.47 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 3%|β | 11000/321773 [00:01<00:39, 7926.71 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 7%|β | 23000/321773 [00:01<00:16, 18603.60 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 10%|β | 32084/321773 [00:02<00:16, 18101.27 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 12%|ββ | 38140/321773 [00:02<00:14, 19642.37 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 14%|ββ | 43492/321773 [00:02<00:11, 23312.83 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 15%|ββ | 48844/321773 [00:02<00:10, 26170.39 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 17%|ββ | 53872/321773 [00:03<00:10, 26701.03 exampl |
|
|
2: es/s]
Dropping Long Sequences (>16384) (num_proc=192): 50%|βββββ | 161644/321773 [00:03<00:00, 201063.95 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 61%|βββββββ | 197136/321773 [00:03<00:00, 206298.85 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 71%|βββββββ | 227952/321773 [00:03<00:00, 189555.60 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 79%|ββββββββ | 254500/321773 [00:03<00:00, 187952.77 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 87%|βββββββββ | 278372/321773 [00:03<00:00, 191215.20 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 94%|ββββββββββ| 301186/321773 [00:04<00:00, 178456.22 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 100%|ββββββββββ| 321773/321773 [00:04<00:00, 109467.24 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 100%|ββββββββββ| 321773/3 |
|
|
2: 21773 [00:04<00:00, 67525.21 examples/s] |
|
|
2:
Drop Samples with Zero Trainable Tokens (num_proc=192): 0%| | 0/315947 [00:00<?, ? examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 0%| | 1000/315947 [00:01<09:07, 574.79 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 3%|β | 8000/315947 [00:01<00:52, 5839.95 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 5%|β | 16000/315947 [00:01<00:23, 12743.90 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 8%|β | 24292/315947 [00:02<00:14, 20794.25 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 10%|β | 30522/315947 [00:02<00:11, 24083.01 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 11%|ββ | 35814/315947 [00:02<00:09, 28141.59 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 13%|ββ | 41752/315947 [00:02<00:08, 31272.49 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 15 |
|
|
2: %|ββ | 46690/315947 [00:02<00:07, 34698.34 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 16%|ββ | 51628/315947 [00:02<00:07, 33735.17 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 18%|ββ | 56858/315947 [00:02<00:07, 34159.27 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 19%|ββ | 61150/315947 [00:03<00:07, 34975.46 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 21%|ββ | 66088/315947 [00:03<00:06, 37509.91 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 23%|βββ | 72026/315947 [00:03<00:06, 40401.96 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 24%|βββ | 76610/315947 [00:03<00:05, 41346.15 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 26%|βββ | 81548/315947 [00:03<00:05, 39889.07 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 27%|βββ |
|
|
2: | 86486/315947 [00:03<00:05, 39849.91 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 29%|βββ | 90778/315947 [00:03<00:05, 39964.42 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 30%|βββ | 96362/315947 [00:04<00:08, 24697.63 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 44%|βββββ | 139512/315947 [00:04<00:01, 95467.36 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 92%|ββββββββββ| 289887/315947 [00:04<00:00, 372991.79 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 100%|ββββββββββ| 315947/315947 [00:05<00:00, 59605.93 examples/s] |
|
|
2:
Add position_id column (Sample Packing) (num_proc=192): 0%| | 0/315947 [00:00<?, ? examples/s]
Add position_id column (Sample Packing) (num_proc=192): 0%| | 1000/315947 [00:02<11:28, 457.38 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 6%|β | 18000/315947 [00:02<00:27, 10853.73 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 10%|β | 31000/315947 [00:02<00:13, 20432.30 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 14%|ββ | 44000/315947 [00:02<00:08, 31277.48 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 18%|ββ | 55646/315947 [00:02<00:06, 40172.99 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 21%|ββ | 66584/315947 [00:02<00:05, 43741.25 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 24%|βββ | 75876/315947 [00:03<00:05, 43987.48 examples/s]
Add position_id column (Sample Packing) (num_proc=1 |
|
|
2: 92): 26%|βββ | 83460/315947 [00:03<00:06, 36406.83 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 28%|βββ | 89274/315947 [00:03<00:06, 34098.44 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 44%|βββββ | 140344/315947 [00:03<00:01, 106033.74 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 50%|βββββ | 159450/315947 [00:03<00:01, 117072.05 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 56%|ββββββ | 177618/315947 [00:03<00:01, 112166.72 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 61%|βββββββ | 193556/315947 [00:04<00:01, 115644.44 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 66%|βββββββ | 208494/315947 [00:04<00:00, 115935.54 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 71%|βββββββ | 222951/315947 [00:04<00:00, 118182.98 examples/s]
Add posit |
|
|
2: ion_id column (Sample Packing) (num_proc=192): 75%|ββββββββ | 237276/315947 [00:04<00:00, 120116.28 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 80%|ββββββββ | 251503/315947 [00:04<00:00, 123948.93 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 85%|βββββββββ | 267246/315947 [00:04<00:00, 131860.70 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 89%|βββββββββ | 282342/315947 [00:04<00:00, 130278.29 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 94%|ββββββββββ| 295952/315947 [00:04<00:00, 112389.92 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 98%|ββββββββββ| 308207/315947 [00:05<00:00, 89401.52 examples/s]
Add position_id column (Sample Packing) (num_proc=192): 100%|ββββββββββ| 315947/315947 [00:05<00:00, 54860.47 examples/s] |
|
|
2:
Saving the dataset (0/192 shards): 0%| | 0/315947 [00:00<?, ? examples/s]
Saving the dataset (0/192 shards): 1%| | 1646/315947 [00:01<05:27, 958.98 examples/s]
Saving the dataset (1/192 shards): 1%| | 1646/315947 [00:01<05:27, 958.98 examples/s]
Saving the dataset (2/192 shards): 2%|β | 4938/315947 [00:01<05:24, 958.98 examples/s]
Saving the dataset (3/192 shards): 2%|β | 4938/315947 [00:01<05:24, 958.98 examples/s]
Saving the dataset (4/192 shards): 2%|β | 6584/315947 [00:01<05:22, 958.98 examples/s]
Saving the dataset (5/192 shards): 3%|β | 8230/315947 [00:01<05:20, 958.98 examples/s]
Saving the dataset (6/192 shards): 3%|β | 9876/315947 [00:01<05:19, 958.98 examples/s]
Saving the dataset (7/192 shards): 4%|β | 11522/315947 [00:01<05:17, 958.98 examples/s]
Saving the dataset (8/192 shards): 4%|β | 13168/315947 [00:01<05:15, 958.98 examples/s]
Saving the dataset (9/192 shards): 5%|β |
|
|
2: | 14814/315947 [00:01<05:14, 958.98 examples/s]
Saving the dataset (10/192 shards): 5%|β | 16460/315947 [00:01<05:12, 958.98 examples/s]
Saving the dataset (11/192 shards): 6%|β | 18106/315947 [00:01<05:10, 958.98 examples/s]
Saving the dataset (12/192 shards): 6%|β | 19752/315947 [00:01<05:08, 958.98 examples/s]
Saving the dataset (13/192 shards): 7%|β | 23044/315947 [00:01<05:05, 958.98 examples/s]
Saving the dataset (14/192 shards): 7%|β | 23044/315947 [00:01<05:05, 958.98 examples/s]
Saving the dataset (15/192 shards): 8%|β | 24690/315947 [00:01<05:03, 958.98 examples/s]
Saving the dataset (16/192 shards): 8%|β | 26336/315947 [00:01<05:02, 958.98 examples/s]
Saving the dataset (17/192 shards): 9%|β | 27982/315947 [00:01<05:00, 958.98 examples/s]
Saving the dataset (18/192 shards): 9%|β | 29628/315947 [00:01<04:58, 958.98 examples/s]
Saving the dataset (19/192 shards): 10%|β | 31274/315 |
|
|
2: 947 [00:01<04:56, 958.98 examples/s]
Saving the dataset (20/192 shards): 10%|β | 32920/315947 [00:01<04:55, 958.98 examples/s]
Saving the dataset (21/192 shards): 11%|ββ | 36212/315947 [00:01<04:51, 958.98 examples/s]
Saving the dataset (22/192 shards): 11%|ββ | 36212/315947 [00:01<04:51, 958.98 examples/s]
Saving the dataset (23/192 shards): 12%|ββ | 37858/315947 [00:01<04:49, 958.98 examples/s]
Saving the dataset (24/192 shards): 13%|ββ | 41150/315947 [00:01<04:46, 958.98 examples/s]
Saving the dataset (25/192 shards): 13%|ββ | 41150/315947 [00:01<04:46, 958.98 examples/s]
Saving the dataset (26/192 shards): 14%|ββ | 44442/315947 [00:01<04:43, 958.98 examples/s]
Saving the dataset (27/192 shards): 14%|ββ | 44442/315947 [00:01<04:43, 958.98 examples/s]
Saving the dataset (28/192 shards): 15%|ββ | 46088/315947 [00:01<04:41, 958.98 examples/s]
Saving the dataset (29/192 shards): 15%|ββ | 47734 |
|
|
2: /315947 [00:01<04:39, 958.98 examples/s]
Saving the dataset (30/192 shards): 16%|ββ | 49380/315947 [00:01<04:37, 958.98 examples/s]
Saving the dataset (31/192 shards): 16%|ββ | 51026/315947 [00:01<04:36, 958.98 examples/s]
Saving the dataset (32/192 shards): 17%|ββ | 52672/315947 [00:01<04:34, 958.98 examples/s]
Saving the dataset (33/192 shards): 18%|ββ | 55964/315947 [00:01<04:31, 958.98 examples/s]
Saving the dataset (34/192 shards): 18%|ββ | 55964/315947 [00:01<04:31, 958.98 examples/s]
Saving the dataset (35/192 shards): 18%|ββ | 57610/315947 [00:01<04:29, 958.98 examples/s]
Saving the dataset (36/192 shards): 19%|ββ | 60902/315947 [00:01<04:25, 958.98 examples/s]
Saving the dataset (37/192 shards): 19%|ββ | 60902/315947 [00:01<04:25, 958.98 examples/s]
Saving the dataset (38/192 shards): 20%|ββ | 62548/315947 [00:01<04:24, 958.98 examples/s]
Saving the dataset (39/192 shards): 20%|ββ | |
|
|
2: 64194/315947 [00:01<04:22, 958.98 examples/s]
Saving the dataset (40/192 shards): 21%|ββ | 65840/315947 [00:01<04:20, 958.98 examples/s]
Saving the dataset (41/192 shards): 22%|βββ | 69132/315947 [00:01<04:17, 958.98 examples/s]
Saving the dataset (42/192 shards): 22%|βββ | 70778/315947 [00:01<04:15, 958.98 examples/s]
Saving the dataset (43/192 shards): 22%|βββ | 70778/315947 [00:01<04:15, 958.98 examples/s]
Saving the dataset (44/192 shards): 23%|βββ | 72424/315947 [00:01<04:13, 958.98 examples/s]
Saving the dataset (45/192 shards): 24%|βββ | 75716/315947 [00:01<04:10, 958.98 examples/s]
Saving the dataset (46/192 shards): 24%|βββ | 75716/315947 [00:01<04:10, 958.98 examples/s]
Saving the dataset (47/192 shards): 24%|βββ | 77362/315947 [00:01<04:08, 958.98 examples/s]
Saving the dataset (48/192 shards): 26%|βββ | 80654/315947 [00:01<04:05, 958.98 examples/s]
Saving the dataset (49/192 shards) |
|
|
2: : 26%|βββ | 80654/315947 [00:01<04:05, 958.98 examples/s]
Saving the dataset (50/192 shards): 26%|βββ | 82300/315947 [00:01<04:03, 958.98 examples/s]
Saving the dataset (51/192 shards): 27%|βββ | 83946/315947 [00:01<04:01, 958.98 examples/s]
Saving the dataset (52/192 shards): 27%|βββ | 85592/315947 [00:01<04:00, 958.98 examples/s]
Saving the dataset (53/192 shards): 28%|βββ | 87238/315947 [00:01<03:58, 958.98 examples/s]
Saving the dataset (54/192 shards): 28%|βββ | 88884/315947 [00:01<03:56, 958.98 examples/s]
Saving the dataset (55/192 shards): 29%|βββ | 90530/315947 [00:01<03:55, 958.98 examples/s]
Saving the dataset (56/192 shards): 30%|βββ | 93822/315947 [00:01<03:51, 958.98 examples/s]
Saving the dataset (57/192 shards): 30%|βββ | 95468/315947 [00:01<03:49, 958.98 examples/s]
Saving the dataset (58/192 shards): 30%|βββ | 95468/315947 [00:01<03:49, 958.98 examples/s]
Saving t |
|
|
2: he dataset (59/192 shards): 31%|ββββ | 98760/315947 [00:01<03:46, 958.98 examples/s]
Saving the dataset (60/192 shards): 31%|ββββ | 98760/315947 [00:01<03:46, 958.98 examples/s]
Saving the dataset (61/192 shards): 32%|ββββ | 100406/315947 [00:01<03:44, 958.98 examples/s]
Saving the dataset (62/192 shards): 32%|ββββ | 102052/315947 [00:01<03:43, 958.98 examples/s]
Saving the dataset (63/192 shards): 33%|ββββ | 103698/315947 [00:01<03:41, 958.98 examples/s]
Saving the dataset (64/192 shards): 33%|ββββ | 105344/315947 [00:01<03:39, 958.98 examples/s]
Saving the dataset (65/192 shards): 34%|ββββ | 108636/315947 [00:01<03:36, 958.98 examples/s]
Saving the dataset (66/192 shards): 34%|ββββ | 108636/315947 [00:01<03:36, 958.98 examples/s]
Saving the dataset (67/192 shards): 35%|ββββ | 110282/315947 [00:01<03:34, 958.98 examples/s]
Saving the dataset (68/192 shards): 35%|ββββ | 1 |
|
|
2: 11928/315947 [00:01<03:32, 958.98 examples/s]
Saving the dataset (69/192 shards): 36%|ββββ | 113574/315947 [00:01<03:31, 958.98 examples/s]
Saving the dataset (70/192 shards): 36%|ββββ | 115220/315947 [00:01<03:29, 958.98 examples/s]
Saving the dataset (71/192 shards): 37%|ββββ | 116866/315947 [00:01<03:27, 958.98 examples/s]
Saving the dataset (72/192 shards): 38%|ββββ | 118512/315947 [00:01<03:25, 958.98 examples/s]
Saving the dataset (73/192 shards): 38%|ββββ | 120158/315947 [00:01<03:24, 958.98 examples/s]
Saving the dataset (74/192 shards): 39%|ββββ | 121804/315947 [00:01<03:22, 958.98 examples/s]
Saving the dataset (75/192 shards): 39%|ββββ | 123450/315947 [00:01<03:20, 958.98 examples/s]
Saving the dataset (76/192 shards): 40%|ββββ | 125096/315947 [00:01<03:19, 958.98 examples/s]
Saving the dataset (77/192 shards): 40%|ββββ | 126742/315947 [00:01<03:17, 958.98 examples/s]
Saving |
|
|
2: the dataset (78/192 shards): 41%|ββββ | 130034/315947 [00:01<03:13, 958.98 examples/s]
Saving the dataset (79/192 shards): 42%|βββββ | 131680/315947 [00:01<03:12, 958.98 examples/s]
Saving the dataset (80/192 shards): 42%|βββββ | 131680/315947 [00:01<03:12, 958.98 examples/s]
Saving the dataset (81/192 shards): 42%|βββββ | 133326/315947 [00:01<03:10, 958.98 examples/s]
Saving the dataset (82/192 shards): 43%|βββββ | 134972/315947 [00:01<03:08, 958.98 examples/s]
Saving the dataset (83/192 shards): 43%|βββββ | 136618/315947 [00:01<03:07, 958.98 examples/s]
Saving the dataset (84/192 shards): 44%|βββββ | 138264/315947 [00:01<03:05, 958.98 examples/s]
Saving the dataset (85/192 shards): 44%|βββββ | 139910/315947 [00:01<03:03, 958.98 examples/s]
Saving the dataset (86/192 shards): 45%|βββββ | 141556/315947 [00:01<03:01, 958.98 examples/s]
Saving the dataset (87/192 shards): 45%|οΏ½ |
|
|
2: οΏ½οΏ½ββββ | 143202/315947 [00:01<03:00, 958.98 examples/s]
Saving the dataset (88/192 shards): 46%|βββββ | 144848/315947 [00:01<02:58, 958.98 examples/s]
Saving the dataset (89/192 shards): 46%|βββββ | 146494/315947 [00:01<02:56, 958.98 examples/s]
Saving the dataset (90/192 shards): 47%|βββββ | 148140/315947 [00:01<02:54, 958.98 examples/s]
Saving the dataset (91/192 shards): 48%|βββββ | 151432/315947 [00:01<02:51, 958.98 examples/s]
Saving the dataset (92/192 shards): 48%|βββββ | 151432/315947 [00:01<02:51, 958.98 examples/s]
Saving the dataset (93/192 shards): 48%|βββββ | 153078/315947 [00:01<02:49, 958.98 examples/s]
Saving the dataset (94/192 shards): 49%|βββββ | 154724/315947 [00:01<02:48, 958.98 examples/s]
Saving the dataset (95/192 shards): 49%|βββββ | 156370/315947 [00:01<02:46, 958.98 examples/s]
Saving the dataset (96/192 shards): 50%|βββββ | 158016/315947 |
|
|
2: [00:01<02:44, 958.98 examples/s]
Saving the dataset (97/192 shards): 51%|βββββ | 159662/315947 [00:01<02:42, 958.98 examples/s]
Saving the dataset (98/192 shards): 51%|βββββ | 161308/315947 [00:01<02:41, 958.98 examples/s]
Saving the dataset (99/192 shards): 52%|ββββββ | 162954/315947 [00:01<02:39, 958.98 examples/s]
Saving the dataset (100/192 shards): 52%|ββββββ | 164600/315947 [00:01<02:37, 958.98 examples/s]
Saving the dataset (101/192 shards): 53%|ββββββ | 166246/315947 [00:01<02:36, 958.98 examples/s]
Saving the dataset (102/192 shards): 53%|ββββββ | 167892/315947 [00:01<02:34, 958.98 examples/s]
Saving the dataset (103/192 shards): 54%|ββββββ | 169538/315947 [00:01<02:32, 958.98 examples/s]
Saving the dataset (104/192 shards): 55%|ββββββ | 172830/315947 [00:01<02:29, 958.98 examples/s]
Saving the dataset (105/192 shards): 55%|ββββββ | 172830/315947 [00:01<02:29, |
|
|
2: 958.98 examples/s]
Saving the dataset (106/192 shards): 55%|ββββββ | 174476/315947 [00:01<02:27, 958.98 examples/s]
Saving the dataset (107/192 shards): 56%|ββββββ | 176122/315947 [00:01<02:25, 958.98 examples/s]
Saving the dataset (108/192 shards): 57%|ββββββ | 181057/315947 [00:01<02:20, 958.98 examples/s]
Saving the dataset (109/192 shards): 57%|ββββββ | 181057/315947 [00:01<02:20, 958.98 examples/s]
Saving the dataset (110/192 shards): 57%|ββββββ | 181057/315947 [00:01<02:20, 958.98 examples/s]
Saving the dataset (111/192 shards): 58%|ββββββ | 182702/315947 [00:01<02:18, 958.98 examples/s]
Saving the dataset (112/192 shards): 59%|ββββββ | 185992/315947 [00:01<02:15, 958.98 examples/s]
Saving the dataset (113/192 shards): 59%|ββββββ | 185992/315947 [00:01<02:15, 958.98 examples/s]
Saving the dataset (114/192 shards): 59%|ββββββ | 187637/315947 [00:01<02:13, 958.98 |
|
|
2: examples/s]
Saving the dataset (115/192 shards): 60%|ββββββ | 189282/315947 [00:01<02:12, 958.98 examples/s]
Saving the dataset (116/192 shards): 60%|ββββββ | 190927/315947 [00:01<02:10, 958.98 examples/s]
Saving the dataset (117/192 shards): 61%|ββββββ | 192572/315947 [00:01<02:08, 958.98 examples/s]
Saving the dataset (118/192 shards): 61%|βββββββ | 194217/315947 [00:01<02:06, 958.98 examples/s]
Saving the dataset (119/192 shards): 62%|βββββββ | 195862/315947 [00:01<02:05, 958.98 examples/s]
Saving the dataset (120/192 shards): 63%|βββββββ | 197507/315947 [00:01<02:03, 958.98 examples/s]
Saving the dataset (121/192 shards): 63%|βββββββ | 199152/315947 [00:01<02:01, 958.98 examples/s]
Saving the dataset (122/192 shards): 64%|βββββββ | 200797/315947 [00:01<02:00, 958.98 examples/s]
Saving the dataset (123/192 shards): 64%|βββββββ | 202442/315947 [00:01<01:58, 9 |
|
|
2: 58.98 examples/s]
Saving the dataset (124/192 shards): 65%|βββββββ | 205732/315947 [00:01<01:54, 958.98 examples/s]
Saving the dataset (125/192 shards): 66%|βββββββ | 207377/315947 [00:01<01:53, 958.98 examples/s]
Saving the dataset (126/192 shards): 66%|βββββββ | 207377/315947 [00:01<01:53, 958.98 examples/s]
Saving the dataset (127/192 shards): 67%|βββββββ | 210667/315947 [00:01<01:49, 958.98 examples/s]
Saving the dataset (128/192 shards): 67%|βββββββ | 210667/315947 [00:01<01:49, 958.98 examples/s]
Saving the dataset (129/192 shards): 67%|βββββββ | 212312/315947 [00:01<01:48, 958.98 examples/s]
Saving the dataset (130/192 shards): 68%|βββββββ | 213957/315947 [00:01<01:46, 958.98 examples/s]
Saving the dataset (131/192 shards): 69%|βββββββ | 217247/315947 [00:01<01:42, 958.98 examples/s]
Saving the dataset (132/192 shards): 69%|βββββββ | 217247/315947 [00: |
|
|
2: 01<01:42, 958.98 examples/s]
Saving the dataset (133/192 shards): 69%|βββββββ | 218892/315947 [00:01<01:41, 958.98 examples/s]
Saving the dataset (134/192 shards): 70%|βββββββ | 220537/315947 [00:01<01:39, 958.98 examples/s]
Saving the dataset (135/192 shards): 70%|βββββββ | 222182/315947 [00:01<01:37, 958.98 examples/s]
Saving the dataset (136/192 shards): 71%|βββββββ | 223827/315947 [00:01<01:36, 958.98 examples/s]
Saving the dataset (137/192 shards): 71%|ββββββββ | 225472/315947 [00:01<01:34, 958.98 examples/s]
Saving the dataset (138/192 shards): 72%|ββββββββ | 227117/315947 [00:01<01:32, 958.98 examples/s]
Saving the dataset (139/192 shards): 72%|ββββββββ | 228762/315947 [00:01<01:30, 958.98 examples/s]
Saving the dataset (140/192 shards): 73%|ββββββββ | 230407/315947 [00:01<01:29, 958.98 examples/s]
Saving the dataset (141/192 shards): 73%|ββββββββ |
|
|
2: | 232052/315947 [00:01<01:27, 958.98 examples/s]
Saving the dataset (142/192 shards): 74%|ββββββββ | 233697/315947 [00:01<01:25, 958.98 examples/s]
Saving the dataset (143/192 shards): 74%|ββββββββ | 235342/315947 [00:01<01:24, 958.98 examples/s]
Saving the dataset (144/192 shards): 75%|ββββββββ | 236987/315947 [00:01<01:22, 958.98 examples/s]
Saving the dataset (145/192 shards): 76%|ββββββββ | 238632/315947 [00:01<01:20, 958.98 examples/s]
Saving the dataset (146/192 shards): 76%|ββββββββ | 240277/315947 [00:01<01:18, 958.98 examples/s]
Saving the dataset (147/192 shards): 77%|ββββββββ | 243567/315947 [00:01<01:15, 958.98 examples/s]
Saving the dataset (148/192 shards): 77%|ββββββββ | 243567/315947 [00:01<01:15, 958.98 examples/s]
Saving the dataset (149/192 shards): 78%|ββββββββ | 245212/315947 [00:01<01:13, 958.98 examples/s]
Saving the dataset (150/192 shards): |
|
|
2: 78%|ββββββββ | 246857/315947 [00:01<01:12, 958.98 examples/s]
Saving the dataset (151/192 shards): 79%|ββββββββ | 248502/315947 [00:01<01:10, 958.98 examples/s]
Saving the dataset (152/192 shards): 80%|ββββββββ | 251792/315947 [00:01<01:06, 958.98 examples/s]
Saving the dataset (153/192 shards): 80%|ββββββββ | 251792/315947 [00:01<01:06, 958.98 examples/s]
Saving the dataset (154/192 shards): 80%|ββββββββ | 253437/315947 [00:01<01:05, 958.98 examples/s]
Saving the dataset (155/192 shards): 81%|ββββββββ | 255082/315947 [00:01<01:03, 958.98 examples/s]
Saving the dataset (156/192 shards): 82%|βββββββββ | 258372/315947 [00:01<01:00, 958.98 examples/s]
Saving the dataset (157/192 shards): 82%|βββββββββ | 258372/315947 [00:01<01:00, 958.98 examples/s]
Saving the dataset (158/192 shards): 82%|βββββββββ | 260017/315947 [00:01<00:58, 958.98 examples/s]
Sav |
|
|
2: ing the dataset (159/192 shards): 83%|βββββββββ | 261662/315947 [00:01<00:56, 958.98 examples/s]
Saving the dataset (160/192 shards): 83%|βββββββββ | 263307/315947 [00:01<00:54, 958.98 examples/s]
Saving the dataset (161/192 shards): 84%|βββββββββ | 264952/315947 [00:01<00:53, 958.98 examples/s]
Saving the dataset (162/192 shards): 84%|βββββββββ | 266597/315947 [00:01<00:51, 958.98 examples/s]
Saving the dataset (163/192 shards): 85%|βββββββββ | 268242/315947 [00:01<00:49, 958.98 examples/s]
Saving the dataset (164/192 shards): 85%|βββββββββ | 269887/315947 [00:01<00:48, 958.98 examples/s]
Saving the dataset (165/192 shards): 86%|βββββββββ | 273177/315947 [00:01<00:44, 958.98 examples/s]
Saving the dataset (166/192 shards): 86%|βββββββββ | 273177/315947 [00:01<00:44, 958.98 examples/s]
Saving the dataset (167/192 shards): 87%|βββββββββ | 274 |
|
|
2: 822/315947 [00:01<00:42, 958.98 examples/s]
Saving the dataset (168/192 shards): 88%|βββββββββ | 276467/315947 [00:01<00:41, 958.98 examples/s]
Saving the dataset (169/192 shards): 88%|βββββββββ | 278112/315947 [00:01<00:39, 958.98 examples/s]
Saving the dataset (170/192 shards): 89%|βββββββββ | 279757/315947 [00:01<00:37, 958.98 examples/s]
Saving the dataset (171/192 shards): 89%|βββββββββ | 281402/315947 [00:01<00:36, 958.98 examples/s]
Saving the dataset (172/192 shards): 90%|βββββββββ | 283047/315947 [00:01<00:34, 958.98 examples/s]
Saving the dataset (173/192 shards): 90%|βββββββββ | 284692/315947 [00:01<00:32, 958.98 examples/s]
Saving the dataset (174/192 shards): 91%|βββββββββ | 286337/315947 [00:01<00:30, 958.98 examples/s]
Saving the dataset (175/192 shards): 92%|ββββββββββ| 289627/315947 [00:01<00:27, 958.98 examples/s]
Saving the dataset (176/19 |
|
|
2: 2 shards): 92%|ββββββββββ| 289627/315947 [00:01<00:27, 958.98 examples/s]
Saving the dataset (177/192 shards): 93%|ββββββββββ| 292917/315947 [00:01<00:24, 958.98 examples/s]
Saving the dataset (178/192 shards): 93%|ββββββββββ| 292917/315947 [00:01<00:24, 958.98 examples/s]
Saving the dataset (179/192 shards): 93%|ββββββββββ| 294562/315947 [00:01<00:22, 958.98 examples/s]
Saving the dataset (180/192 shards): 94%|ββββββββββ| 296207/315947 [00:01<00:20, 958.98 examples/s]
Saving the dataset (181/192 shards): 94%|ββββββββββ| 297852/315947 [00:01<00:18, 958.98 examples/s]
Saving the dataset (182/192 shards): 95%|ββββββββββ| 301142/315947 [00:01<00:15, 958.98 examples/s]
Saving the dataset (183/192 shards): 95%|ββββββββββ| 301142/315947 [00:01<00:15, 958.98 examples/s]
Saving the dataset (184/192 shards): 96%|ββββββββββ| 304432/3 |
|
|
2: 15947 [00:01<00:12, 958.98 examples/s]
Saving the dataset (185/192 shards): 96%|ββββββββββ| 304432/315947 [00:01<00:12, 958.98 examples/s]
Saving the dataset (186/192 shards): 97%|ββββββββββ| 306077/315947 [00:01<00:10, 958.98 examples/s]
Saving the dataset (187/192 shards): 97%|ββββββββββ| 307722/315947 [00:01<00:08, 958.98 examples/s]
Saving the dataset (188/192 shards): 98%|ββββββββββ| 309367/315947 [00:01<00:06, 958.98 examples/s]
Saving the dataset (189/192 shards): 98%|ββββββββββ| 311012/315947 [00:01<00:05, 958.98 examples/s]
Saving the dataset (190/192 shards): 99%|ββββββββββ| 312657/315947 [00:01<00:03, 958.98 examples/s]
Saving the dataset (191/192 shards): 99%|ββββββββββ| 314302/315947 [00:01<00:01, 958.98 examples/s]
Saving the dataset (192/192 shards): 100%|ββββββββββ| 315947/315947 [00:01<00:00, 958.98 examples/s]
Saving the datase |
|
|
2: t (192/192 shards): 100%|ββββββββββ| 315947/315947 [00:01<00:00, 169075.72 examples/s] |
|
|
0: [2025-09-02 18:43:32,409] [INFO] [axolotl.utils.data.shared.load_preprocessed_dataset:472] [PID:1478787] [RANK:0] Loading prepared dataset from disk at /lustre/fsn1/projects/rech/dgo/udv55np/dataset_math/Qwen3-235B-A22B/0/b1771c7e92212c2fb90b5a0bac7a225c...[39m |
|
|
0: [2025-09-02 18:45:12,633] [INFO] [axolotl.utils.samplers.multipack.calc_min_len:436] [PID:1478787] [RANK:0] gather_len_batches: [25939, 25939, 25938, 25939, 25939, 25938, 25940, 25938, 25939, 25939, 25939, 25937, 25939, 25940, 25939, 25938][39m |
|
|
0: [2025-09-02 18:45:12,660] [INFO] [axolotl.utils.trainer.calc_sample_packing_eff_est:495] [PID:1478787] [RANK:0] sample_packing_eff_est across ranks: [0.9965550303459167, 0.9965550303459167, 0.9964781999588013, 0.9965550303459167, 0.9965166449546814, 0.9965934753417969, 0.9965550303459167, 0.9965166449546814, 0.9965934753417969, 0.9965550303459167, 0.9965934753417969, 0.9965934753417969, 0.9965550303459167, 0.9965550303459167, 0.9965934753417969, 0.9965934753417969][39m |
|
|
0: [2025-09-02 18:45:12,665] [INFO] [axolotl.utils.data.sft._prepare_standard_dataset:127] [PID:1478787] [RANK:0] Maximum number of steps set at 1621[39m |
|
|
0: [2025-09-02 18:45:12,990] [INFO] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_evaluation_loop:110] [PID:1478787] [RANK:0] Patched Trainer.evaluation_loop with nanmean loss calculation[39m |
|
|
0: [2025-09-02 18:45:12,991] [INFO] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_maybe_log_save_evaluate:164] [PID:1478787] [RANK:0] Patched Trainer._maybe_log_save_evaluate with nanmean loss calculation[39m |
|
|
0:
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:14<00:14, 14.97s/it]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:14<00:14, 14.97s/it]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:14<00:14, 14.97s/it]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:15<00:15, 15.06s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.51s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.63s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.51s/it] |
|
|
0:
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.51s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.63s/it] |
|
|
3:
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:14<00:14, 14.99s/it]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:14<00:14, 14.99s/it]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:14<00:14, 14.98s/it]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:14<00:14, 14.99s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.52s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.52s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.52s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.64s/it] |
|
|
1:
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:14<00:14, 14.97s/it]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:14<00:14, 14.97s/it]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:14<00:14, 14.97s/it]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:14<00:14, 14.98s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.51s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.51s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.51s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.63s/it] |
|
|
2:
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:15<00:15, 15.02s/it]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:15<00:15, 15.02s/it]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:15<00:15, 15.01s/it]
Loading checkpoint shards: 50%|βββββ | 1/2 [00:15<00:15, 15.01s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.53s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.65s/it] |
|
|
0:
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.63s/it] |
|
|
3:
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.64s/it] |
|
|
1:
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.63s/it] |
|
|
1:
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.63s/it] |
|
|
2:
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.53s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.53s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.65s/it] |
|
|
3:
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.64s/it] |
|
|
2:
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.65s/it] |
|
|
1:
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.51s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.63s/it] |
|
|
3:
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.51s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.63s/it] |
|
|
2:
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.53s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.65s/it] |
|
|
0:
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 7.55s/it]
Loading checkpoint shards: 100%|ββββββββββ| 2/2 [00:17<00:00, 8.67s/it] |
|
|
0: [2025-09-02 18:45:30,643] [INFO] [axolotl.loaders.model._configure_embedding_dtypes:345] [PID:1478787] [RANK:0] Converting modules to torch.bfloat16[39m |
|
|
0: [2025-09-02 18:45:39,599] [INFO] [axolotl.train.save_initial_configs:416] [PID:1478787] [RANK:0] Pre-saving tokenizer to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0...[39m |
|
|
0: [2025-09-02 18:45:39,769] [INFO] [axolotl.train.save_initial_configs:419] [PID:1478787] [RANK:0] Pre-saving model config to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0...[39m |
|
|
0: [2025-09-02 18:45:39,777] [INFO] [axolotl.train.execute_training:203] [PID:1478787] [RANK:0] Starting trainer...[39m |
|
|
0: [2025-09-02 18:47:46,046] [INFO] [axolotl.utils.samplers.multipack.calc_min_len:436] [PID:1478787] [RANK:0] gather_len_batches: [25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939][39m |
|
|
0: Parameter Offload - Persistent parameters statistics: param_count = 181, numel = 241664 |
|
|
0: {'loss': 0.293, 'grad_norm': 0.3438611558851885, 'learning_rate': 7.7e-07, 'memory/max_mem_active(gib)': 35.16, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 41.77, 'epoch': 0.01} |
|
|
0:
0%| | 0/1621 [00:00<?, ?it/s]
0%| | 1/1621 [03:16<88:16:18, 196.16s/it]
0%| | 2/1621 [03:18<37:02:55, 82.38s/it]
0%| | 3/1621 [03:20<20:24:21, 45.40s/it]
0%| | 4/1621 [03:21<12:35:47, 28.04s/it]
0%| | 5/1621 [03:23<8:16:55, 18.45s/it]
0%| | 6/1621 [03:24<5:40:48, 12.66s/it]
0%| | 7/1621 [03:26<4:01:45, 8.99s/it]
0%| | 8/1621 [03:27<2:58:05, 6.62s/it]
1%| | 9/1621 [03:29<2:14:25, 5.00s/it]
1%| | 10/1621 [03:30<1:45:26, 3.93s/it]
1%| | 10/1621 [03:30<1:45:26, 3.93s/it]
1%| | 11/1621 [03:31<1:25:08, 3.17s/it]
1%| | 12/1621 [03:33<1:10:52, 2.64s/it]
1%| | 13/1621 [03:34<1:01:40, 2.30s/it]
1%| | 14/1621 [03:36<54:29, 2.03s/it]
1%| | 15/1621 [03:37<49:26, 1.85s/it]
1%| | 16/1621 [03:39<46:23, 1.73s/it]
1%| | 17/1621 [03:40<43:46, 1.64s/it]
|
|
|
0: {'loss': 0.2902, 'grad_norm': 0.32969781245683594, 'learning_rate': 1.0700000000000001e-06, 'memory/max_mem_active(gib)': 35.16, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 41.77, 'epoch': 0.01} |
|
|
0: {'loss': 0.2916, 'grad_norm': 0.2957381320197924, 'learning_rate': 1.3700000000000002e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.02} |
|
|
0: 1%| | 18/1621 [03:42<42:43, 1.60s/it]
1%| | 19/1621 [03:43<42:48, 1.60s/it]
1%| | 20/1621 [03:45<41:24, 1.55s/it]
1%| | 20/1621 [03:45<41:24, 1.55s/it]
1%|β | 21/1621 [03:47<43:38, 1.64s/it]
1%|β | 22/1621 [03:48<41:53, 1.57s/it]
1%|β | 23/1621 [03:49<40:35, 1.52s/it]
1%|β | 24/1621 [03:51<39:54, 1.50s/it]
2%|β | 25/1621 [03:52<39:20, 1.48s/it]
2%|β | 26/1621 [03:54<38:55, 1.46s/it]
2%|β | 27/1621 [03:55<38:35, 1.45s/it]
2%|β | 28/1621 [03:57<40:41, 1.53s/it]
2%|β | 29/1621 [03:58<39:43, 1.50s/it]
2%|β | 30/1621 [04:00<40:17, 1.52s/it]
2%|β | 30/1621 [04:00<40:17, 1.52s/it]
2%|β | 31/1621 [04:01<40:17, 1.52s/it]
2%|β | 32/1621 [04:03<39:46, 1.50s/it]
2%|β | 33/1621 [04:04<39:13, 1.48s |
|
|
0: {'loss': 0.2879, 'grad_norm': 0.7741302157876179, 'learning_rate': 1.6700000000000003e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.02} |
|
|
0: /it]
2%|β | 34/1621 [04:06<39:51, 1.51s/it]
2%|β | 35/1621 [04:07<39:09, 1.48s/it]
2%|β | 36/1621 [04:09<38:39, 1.46s/it]
2%|β | 37/1621 [04:10<38:28, 1.46s/it]
2%|β | 38/1621 [04:12<38:11, 1.45s/it]
2%|β | 39/1621 [04:13<38:00, 1.44s/it]
2%|β | 40/1621 [04:14<38:20, 1.45s/it]
2%|β | 40/1621 [04:14<38:20, 1.45s/it]
3%|β | 41/1621 [04:16<38:05, 1.45s/it]
3%|β | 42/1621 [04:17<38:07, 1.45s/it]
3%|β | 43/1621 [04:19<38:02, 1.45s/it]
3%|β | 44/1621 [04:20<38:23, 1.46s/it]
3%|β | 45/1621 [04:22<40:59, 1.56s/it]
3%|β | 46/1621 [04:24<40:34, 1.55s/it]
3%|β | 47/1621 [04:25<40:17, 1.54s/it]
3%|β | 48/1621 [04:26<39:20, 1.50s/it]
3%|β | 49/1621 [04:28<38:39, 1.48s/it]
3%|β | 50/1621 [04:29<38:14, 1.46s/it]
|
|
|
0: {'loss': 0.2893, 'grad_norm': 0.30490400114093213, 'learning_rate': 1.97e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.03} |
|
|
0: {'loss': 0.2882, 'grad_norm': 0.299265789503041, 'learning_rate': 2.27e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.04} |
|
|
0:
3%|β | 50/1621 [04:29<38:14, 1.46s/it]
3%|β | 51/1621 [04:31<37:56, 1.45s/it]
3%|β | 52/1621 [04:32<37:40, 1.44s/it]
3%|β | 53/1621 [04:34<40:10, 1.54s/it]
3%|β | 54/1621 [04:35<39:27, 1.51s/it]
3%|β | 55/1621 [04:37<38:36, 1.48s/it]
3%|β | 56/1621 [04:38<38:04, 1.46s/it]
4%|β | 57/1621 [04:40<37:45, 1.45s/it]
4%|β | 58/1621 [04:41<37:26, 1.44s/it]
4%|β | 59/1621 [04:42<37:16, 1.43s/it]
4%|β | 60/1621 [04:44<37:05, 1.43s/it]
4%|β | 60/1621 [04:44<37:05, 1.43s/it]
4%|β | 61/1621 [04:45<37:02, 1.42s/it]
4%|β | 62/1621 [04:47<36:56, 1.42s/it]
4%|β | 63/1621 [04:48<36:46, 1.42s/it]
4%|β | 64/1621 [04:50<36:50, 1.42s/it]
4%|β | 65/1621 [04:51<36:52, 1.42s/it]
4%|β | 66/1621 [04:52<36:47, 1.42s/it]
4%|β | |
|
|
0: {'loss': 0.2852, 'grad_norm': 0.3075656415183605, 'learning_rate': 2.5700000000000004e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.04} |
|
|
0: {'loss': 0.2809, 'grad_norm': 0.2915611347823235, 'learning_rate': 2.87e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.05} |
|
|
0: 67/1621 [04:54<36:53, 1.42s/it]
4%|β | 68/1621 [04:55<36:52, 1.42s/it]
4%|β | 69/1621 [04:57<36:55, 1.43s/it]
4%|β | 70/1621 [04:58<37:38, 1.46s/it]
4%|β | 70/1621 [04:58<37:38, 1.46s/it]
4%|β | 71/1621 [05:00<37:59, 1.47s/it]
4%|β | 72/1621 [05:01<37:34, 1.46s/it]
5%|β | 73/1621 [05:03<37:12, 1.44s/it]
5%|β | 74/1621 [05:04<37:25, 1.45s/it]
5%|β | 75/1621 [05:05<37:39, 1.46s/it]
5%|β | 76/1621 [05:07<37:20, 1.45s/it]
5%|β | 77/1621 [05:08<38:22, 1.49s/it]
5%|β | 78/1621 [05:10<37:41, 1.47s/it]
5%|β | 79/1621 [05:11<37:28, 1.46s/it]
5%|β | 80/1621 [05:13<37:56, 1.48s/it]
5%|β | 80/1621 [05:13<37:56, 1.48s/it]
5%|β | 81/1621 [05:14<37:23, 1.46s/it]
5%|β | 82/1621 [05:16<37:36, 1.47s/it]
5% |
|
|
0: {'loss': 0.2866, 'grad_norm': 0.3188606948350198, 'learning_rate': 3.17e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.06} |
|
|
0: |β | 83/1621 [05:17<37:10, 1.45s/it]
5%|β | 84/1621 [05:19<36:55, 1.44s/it]
5%|β | 85/1621 [05:20<36:49, 1.44s/it]
5%|β | 86/1621 [05:21<36:38, 1.43s/it]
5%|β | 87/1621 [05:23<36:44, 1.44s/it]
5%|β | 88/1621 [05:24<36:29, 1.43s/it]
5%|β | 89/1621 [05:26<36:19, 1.42s/it]
6%|β | 90/1621 [05:27<36:49, 1.44s/it]
6%|β | 90/1621 [05:27<36:49, 1.44s/it]
6%|β | 91/1621 [05:29<36:33, 1.43s/it]
6%|β | 92/1621 [05:30<36:45, 1.44s/it]
6%|β | 93/1621 [05:32<37:16, 1.46s/it]
6%|β | 94/1621 [05:33<36:47, 1.45s/it]
6%|β | 95/1621 [05:34<36:28, 1.43s/it]
6%|β | 96/1621 [05:36<36:20, 1.43s/it]
6%|β | 97/1621 [05:38<38:52, 1.53s/it]
6%|β | 98/1621 [05:39<38:14, 1.51s/it]
6%|β | 99/1621 [05:40<37:28, 1.48s/it]
6%|β | 100/1621 [05:42<36:55, |
|
|
0: {'loss': 0.2892, 'grad_norm': 0.3196600090952391, 'learning_rate': 3.4700000000000007e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.06} |
|
|
0: {'loss': 0.2886, 'grad_norm': 0.3064739256293986, 'learning_rate': 3.7700000000000003e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.07} |
|
|
0: 1.46s/it]
6%|β | 100/1621 [05:42<36:55, 1.46s/it]
6%|β | 101/1621 [05:43<37:06, 1.46s/it]
6%|β | 102/1621 [05:45<36:37, 1.45s/it]
6%|β | 103/1621 [05:46<36:27, 1.44s/it]
6%|β | 104/1621 [05:48<36:37, 1.45s/it]
6%|β | 105/1621 [05:49<36:16, 1.44s/it]
7%|β | 106/1621 [05:50<36:26, 1.44s/it]
7%|β | 107/1621 [05:52<36:24, 1.44s/it]
7%|β | 108/1621 [05:53<36:24, 1.44s/it]
7%|β | 109/1621 [05:55<36:33, 1.45s/it]
7%|β | 110/1621 [05:56<36:19, 1.44s/it]
7%|β | 110/1621 [05:56<36:19, 1.44s/it]
7%|β | 111/1621 [05:58<36:14, 1.44s/it]
7%|β | 112/1621 [05:59<35:59, 1.43s/it]
7%|β | 113/1621 [06:01<36:09, 1.44s/it]
7%|β | 114/1621 [06:02<36:03, 1.44s/it]
7%|β | 115/1621 [06:03<35:50, 1.43s/it]
7%|οΏ½ |
|
|
0: {'loss': 0.2892, 'grad_norm': 0.4347169189893856, 'learning_rate': 4.07e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.07} |
|
|
0: {'loss': 0.2864, 'grad_norm': 0.2939409486111771, 'learning_rate': 4.3700000000000005e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.08} |
|
|
0: οΏ½ | 116/1621 [06:05<35:58, 1.43s/it]
7%|β | 117/1621 [06:06<35:45, 1.43s/it]
7%|β | 118/1621 [06:08<37:32, 1.50s/it]
7%|β | 119/1621 [06:09<36:53, 1.47s/it]
7%|β | 120/1621 [06:11<39:52, 1.59s/it]
7%|β | 120/1621 [06:11<39:52, 1.59s/it]
7%|β | 121/1621 [06:13<38:33, 1.54s/it]
8%|β | 122/1621 [06:14<38:12, 1.53s/it]
8%|β | 123/1621 [06:16<37:29, 1.50s/it]
8%|β | 124/1621 [06:17<38:19, 1.54s/it]
8%|β | 125/1621 [06:19<37:40, 1.51s/it]
8%|β | 126/1621 [06:20<36:54, 1.48s/it]
8%|β | 127/1621 [06:22<37:21, 1.50s/it]
8%|β | 128/1621 [06:23<36:45, 1.48s/it]
8%|β | 129/1621 [06:24<36:20, 1.46s/it]
8%|β | 130/1621 [06:26<36:32, 1.47s/it]
8%|β | 130/1621 [06:26<36:32, 1.47s/it]
8%|β | 131/162 |
|
|
0: {'loss': 0.281, 'grad_norm': 0.3037682241274547, 'learning_rate': 4.67e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.09} |
|
|
0: 1 [06:28<38:40, 1.56s/it]
8%|β | 132/1621 [06:29<38:01, 1.53s/it]
8%|β | 133/1621 [06:31<37:03, 1.49s/it]
8%|β | 134/1621 [06:32<36:20, 1.47s/it]
8%|β | 135/1621 [06:33<35:58, 1.45s/it]
8%|β | 136/1621 [06:35<35:44, 1.44s/it]
8%|β | 137/1621 [06:36<35:36, 1.44s/it]
9%|β | 138/1621 [06:38<35:34, 1.44s/it]
9%|β | 139/1621 [06:39<35:27, 1.44s/it]
9%|β | 140/1621 [06:41<36:24, 1.47s/it]
9%|β | 140/1621 [06:41<36:24, 1.47s/it]
9%|β | 141/1621 [06:42<36:13, 1.47s/it]
9%|β | 142/1621 [06:44<35:42, 1.45s/it]
9%|β | 143/1621 [06:45<35:25, 1.44s/it]
9%|β | 144/1621 [06:46<35:41, 1.45s/it]
9%|β | 145/1621 [06:48<35:31, 1.44s/it]
9%|β | 146/1621 [06:49<35:13, 1.43s/it]
9%|β | 147/1621 [06:51<35:00, 1.43s/it]
9%|β | 148/1621 [06:52<34:49, 1 |
|
|
0: {'loss': 0.2872, 'grad_norm': 0.31896588075238047, 'learning_rate': 4.970000000000001e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.09} |
|
|
0: {'loss': 0.2745, 'grad_norm': 0.3079236544412073, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.1} |
|
|
0: .42s/it]
9%|β | 149/1621 [06:54<35:19, 1.44s/it]
9%|β | 150/1621 [06:55<35:03, 1.43s/it]
9%|β | 150/1621 [06:55<35:03, 1.43s/it]
9%|β | 151/1621 [06:57<36:42, 1.50s/it]
9%|β | 152/1621 [06:58<36:43, 1.50s/it]
9%|β | 153/1621 [07:00<38:25, 1.57s/it]
10%|β | 154/1621 [07:01<37:12, 1.52s/it]
10%|β | 155/1621 [07:03<37:08, 1.52s/it]
10%|β | 156/1621 [07:04<36:52, 1.51s/it]
10%|β | 157/1621 [07:06<36:39, 1.50s/it]
10%|β | 158/1621 [07:07<36:16, 1.49s/it]
10%|β | 159/1621 [07:09<35:54, 1.47s/it]
10%|β | 160/1621 [07:10<37:48, 1.55s/it]
10%|β | 160/1621 [07:10<37:48, 1.55s/it]
10%|β | 161/1621 [07:12<37:09, 1.53s/it]
10%|β | 162/1621 [07:13<36:24, 1.50s/it]
10%|β | 163/1621 [07:15<35:42, 1.47s/it]
10%|β |
|
|
0: {'loss': 0.2822, 'grad_norm': 0.30913370447364324, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.1} |
|
|
0: | 164/1621 [07:16<35:13, 1.45s/it]
10%|β | 165/1621 [07:18<35:36, 1.47s/it]
10%|β | 166/1621 [07:19<35:31, 1.46s/it]
10%|β | 167/1621 [07:21<35:11, 1.45s/it]
10%|β | 168/1621 [07:22<35:03, 1.45s/it]
10%|β | 169/1621 [07:23<34:49, 1.44s/it]
10%|β | 170/1621 [07:25<34:34, 1.43s/it]
10%|β | 170/1621 [07:25<34:34, 1.43s/it]
11%|β | 171/1621 [07:26<34:39, 1.43s/it]
11%|β | 172/1621 [07:28<34:59, 1.45s/it]
11%|β | 173/1621 [07:29<34:44, 1.44s/it]
11%|β | 174/1621 [07:31<34:48, 1.44s/it]
11%|β | 175/1621 [07:32<34:35, 1.44s/it]
11%|β | 176/1621 [07:34<36:14, 1.50s/it]
11%|β | 177/1621 [07:35<36:48, 1.53s/it]
11%|β | 178/1621 [07:37<37:07, 1.54s/it]
11%|β | 179/1621 [07:38<36:20, 1.51s/it]
11%|β | 180/1621 [07:40<35:35, 1.48s/it]
|
|
|
0: {'loss': 0.2757, 'grad_norm': 0.30350079506000416, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.11} |
|
|
0: {'loss': 0.2709, 'grad_norm': 0.31655651271681584, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.12} |
|
|
0:
11%|β | 180/1621 [07:40<35:35, 1.48s/it]
11%|β | 181/1621 [07:41<37:12, 1.55s/it]
11%|β | 182/1621 [07:43<36:18, 1.51s/it]
11%|ββ | 183/1621 [07:44<35:33, 1.48s/it]
11%|ββ | 184/1621 [07:46<35:08, 1.47s/it]
11%|ββ | 185/1621 [07:47<35:46, 1.49s/it]
11%|ββ | 186/1621 [07:49<36:02, 1.51s/it]
12%|ββ | 187/1621 [07:50<36:06, 1.51s/it]
12%|ββ | 188/1621 [07:52<35:49, 1.50s/it]
12%|ββ | 189/1621 [07:53<35:16, 1.48s/it]
12%|ββ | 190/1621 [07:55<35:37, 1.49s/it]
12%|ββ | 190/1621 [07:55<35:37, 1.49s/it]
12%|ββ | 191/1621 [07:56<36:04, 1.51s/it]
12%|ββ | 192/1621 [07:58<36:16, 1.52s/it]
12%|ββ | 193/1621 [07:59<37:00, 1.56s/it]
12%|ββ | 194/1621 [08:01<36:01, 1.51s/it]
12%|ββ | 195/1621 [08:03<37:00, 1.56s/it]
12%|ββ |
|
|
0: {'loss': 0.2814, 'grad_norm': 0.3044835281174804, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.12} |
|
|
0: {'loss': 0.2749, 'grad_norm': 0.3002036469553508, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.13} |
|
|
0: | 196/1621 [08:04<35:52, 1.51s/it]
12%|ββ | 197/1621 [08:05<35:16, 1.49s/it]
12%|ββ | 198/1621 [08:07<34:42, 1.46s/it]
12%|ββ | 199/1621 [08:08<34:24, 1.45s/it]
12%|ββ | 200/1621 [08:10<34:04, 1.44s/it]
12%|ββ | 200/1621 [08:10<34:04, 1.44s/it]
12%|ββ | 201/1621 [08:11<33:54, 1.43s/it]
12%|ββ | 202/1621 [08:12<33:39, 1.42s/it]
13%|ββ | 203/1621 [08:14<33:31, 1.42s/it]
13%|ββ | 204/1621 [08:15<34:04, 1.44s/it]
13%|ββ | 205/1621 [08:17<34:24, 1.46s/it]
13%|ββ | 206/1621 [08:18<34:14, 1.45s/it]
13%|ββ | 207/1621 [08:20<33:53, 1.44s/it]
13%|ββ | 208/1621 [08:21<34:09, 1.45s/it]
13%|ββ | 209/1621 [08:23<34:02, 1.45s/it]
13%|ββ | 210/1621 [08:24<34:30, 1.47s/it]
13%|ββ | 210/1621 [08:24<34:30, 1.47s/it]
13% |
|
|
0: {'loss': 0.2797, 'grad_norm': 0.2967747880428994, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.14} |
|
|
0: |ββ | 211/1621 [08:26<34:10, 1.45s/it]
13%|ββ | 212/1621 [08:27<33:48, 1.44s/it]
13%|ββ | 213/1621 [08:28<33:37, 1.43s/it]
13%|ββ | 214/1621 [08:30<33:32, 1.43s/it]
13%|ββ | 215/1621 [08:31<33:21, 1.42s/it]
13%|ββ | 216/1621 [08:33<34:03, 1.45s/it]
13%|ββ | 217/1621 [08:34<33:43, 1.44s/it]
13%|ββ | 218/1621 [08:36<33:46, 1.44s/it]
14%|ββ | 219/1621 [08:37<33:29, 1.43s/it]
14%|ββ | 220/1621 [08:39<35:14, 1.51s/it]
14%|ββ | 220/1621 [08:39<35:14, 1.51s/it]
14%|ββ | 221/1621 [08:40<34:47, 1.49s/it]
14%|ββ | 222/1621 [08:42<34:32, 1.48s/it]
14%|ββ | 223/1621 [08:43<34:02, 1.46s/it]
14%|ββ | 224/1621 [08:44<34:29, 1.48s/it]
14%|ββ | 225/1621 [08:46<34:08, 1.47s/it]
14%|ββ | 226/1621 [08:47<33:39, 1.45s/it]
14%|ββ | 227/1621 [08:49<34:0 |
|
|
0: {'loss': 0.267, 'grad_norm': 0.31143673730927085, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.14} |
|
|
0: {'loss': 0.2743, 'grad_norm': 0.3155526747255406, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.15} |
|
|
0: 2, 1.47s/it]
14%|ββ | 228/1621 [08:50<34:07, 1.47s/it]
14%|ββ | 229/1621 [08:52<33:45, 1.46s/it]
14%|ββ | 230/1621 [08:53<33:39, 1.45s/it]
14%|ββ | 230/1621 [08:53<33:39, 1.45s/it]
14%|ββ | 231/1621 [08:55<33:31, 1.45s/it]
14%|ββ | 232/1621 [08:56<33:29, 1.45s/it]
14%|ββ | 233/1621 [08:58<33:29, 1.45s/it]
14%|ββ | 234/1621 [08:59<33:17, 1.44s/it]
14%|ββ | 235/1621 [09:00<33:31, 1.45s/it]
15%|ββ | 236/1621 [09:02<33:28, 1.45s/it]
15%|ββ | 237/1621 [09:03<33:31, 1.45s/it]
15%|ββ | 238/1621 [09:05<33:07, 1.44s/it]
15%|ββ | 239/1621 [09:06<33:20, 1.45s/it]
15%|ββ | 240/1621 [09:08<33:17, 1.45s/it]
15%|ββ | 240/1621 [09:08<33:17, 1.45s/it]
15%|ββ | 241/1621 [09:09<33:16, 1.45s/it]
15%|ββ | 242/1 |
|
|
0: {'loss': 0.2788, 'grad_norm': 0.3136432802461907, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.15} |
|
|
0: 621 [09:11<33:18, 1.45s/it]
15%|ββ | 243/1621 [09:12<33:03, 1.44s/it]
15%|ββ | 244/1621 [09:14<35:31, 1.55s/it]
15%|ββ | 245/1621 [09:15<34:31, 1.51s/it]
15%|ββ | 246/1621 [09:17<34:07, 1.49s/it]
15%|ββ | 247/1621 [09:18<33:36, 1.47s/it]
15%|ββ | 248/1621 [09:19<33:11, 1.45s/it]
15%|ββ | 249/1621 [09:21<33:12, 1.45s/it]
15%|ββ | 250/1621 [09:22<33:00, 1.44s/it]
15%|ββ | 250/1621 [09:22<33:00, 1.44s/it]
15%|ββ | 251/1621 [09:24<33:01, 1.45s/it]
16%|ββ | 252/1621 [09:25<32:56, 1.44s/it]
16%|ββ | 253/1621 [09:27<33:06, 1.45s/it]
16%|ββ | 254/1621 [09:28<33:39, 1.48s/it]
16%|ββ | 255/1621 [09:30<33:17, 1.46s/it]
16%|ββ | 256/1621 [09:31<33:37, 1.48s/it]
16%|ββ | 257/1621 [09:33<33:18, 1.46s/it]
16%|ββ | 258/1621 [09:34<32:51, 1.45s/it]
16%|β |
|
|
0: {'loss': 0.2809, 'grad_norm': 0.3423992975218093, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.16} |
|
|
0: {'loss': 0.2701, 'grad_norm': 0.3288994173746047, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.17} |
|
|
0: β | 259/1621 [09:35<32:43, 1.44s/it]
16%|ββ | 260/1621 [09:37<32:41, 1.44s/it]
16%|ββ | 260/1621 [09:37<32:41, 1.44s/it]
16%|ββ | 261/1621 [09:38<32:38, 1.44s/it]
16%|ββ | 262/1621 [09:40<35:34, 1.57s/it]
16%|ββ | 263/1621 [09:42<34:53, 1.54s/it]
16%|ββ | 264/1621 [09:43<34:05, 1.51s/it]
16%|ββ | 265/1621 [09:45<33:58, 1.50s/it]
16%|ββ | 266/1621 [09:46<33:35, 1.49s/it]
16%|ββ | 267/1621 [09:47<32:58, 1.46s/it]
17%|ββ | 268/1621 [09:49<32:32, 1.44s/it]
17%|ββ | 269/1621 [09:50<33:35, 1.49s/it]
17%|ββ | 270/1621 [09:52<33:04, 1.47s/it]
17%|ββ | 270/1621 [09:52<33:04, 1.47s/it]
17%|ββ | 271/1621 [09:53<32:41, 1.45s/it]
17%|ββ | 272/1621 [09:55<34:13, 1.52s/it]
17%|ββ | 273/1621 [09:56<33:25, 1.4 |
|
|
0: {'loss': 0.2807, 'grad_norm': 0.29274925430224524, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.17} |
|
|
0: 9s/it]
17%|ββ | 274/1621 [09:58<32:57, 1.47s/it]
17%|ββ | 275/1621 [09:59<32:34, 1.45s/it]
17%|ββ | 276/1621 [10:01<32:38, 1.46s/it]
17%|ββ | 277/1621 [10:02<32:33, 1.45s/it]
17%|ββ | 278/1621 [10:04<32:28, 1.45s/it]
17%|ββ | 279/1621 [10:05<32:07, 1.44s/it]
17%|ββ | 280/1621 [10:06<31:50, 1.42s/it]
17%|ββ | 280/1621 [10:06<31:50, 1.42s/it]
17%|ββ | 281/1621 [10:08<31:48, 1.42s/it]
17%|ββ | 282/1621 [10:09<31:38, 1.42s/it]
17%|ββ | 283/1621 [10:11<32:33, 1.46s/it]
18%|ββ | 284/1621 [10:12<32:10, 1.44s/it]
18%|ββ | 285/1621 [10:14<32:00, 1.44s/it]
18%|ββ | 286/1621 [10:15<33:05, 1.49s/it]
18%|ββ | 287/1621 [10:17<32:53, 1.48s/it]
18%|ββ | 288/1621 [10:18<32:42, 1.47s/it]
18%|ββ | 289/1621 [10:19<32:14, 1.45s/it]
18%|ββ | 290/1621 |
|
|
0: {'loss': 0.2773, 'grad_norm': 0.31838982571156305, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.18} |
|
|
0: {'loss': 0.278, 'grad_norm': 0.32447615695347176, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.19} |
|
|
0: [10:21<32:04, 1.45s/it]
18%|ββ | 290/1621 [10:21<32:04, 1.45s/it]
18%|ββ | 291/1621 [10:23<33:39, 1.52s/it]
18%|ββ | 292/1621 [10:24<32:57, 1.49s/it]
18%|ββ | 293/1621 [10:26<33:42, 1.52s/it]
18%|ββ | 294/1621 [10:27<33:04, 1.50s/it]
18%|ββ | 295/1621 [10:29<34:11, 1.55s/it]
18%|ββ | 296/1621 [10:30<33:22, 1.51s/it]
18%|ββ | 297/1621 [10:32<32:50, 1.49s/it]
18%|ββ | 298/1621 [10:33<32:43, 1.48s/it]
18%|ββ | 299/1621 [10:34<32:19, 1.47s/it]
19%|ββ | 300/1621 [10:36<32:09, 1.46s/it]
19%|ββ | 300/1621 [10:36<32:09, 1.46s/it]
19%|ββ | 301/1621 [10:37<31:51, 1.45s/it]
19%|ββ | 302/1621 [10:39<31:35, 1.44s/it]
19%|ββ | 303/1621 [10:40<31:23, 1.43s/it]
19%|ββ | 304/1621 [10:42<31:30, 1.44s/it]
19%|ββ |
|
|
0: {'loss': 0.2753, 'grad_norm': 0.342886066749926, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.19} |
|
|
0: {'loss': 0.2776, 'grad_norm': 0.30166316287017786, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.2} |
|
|
0: | 305/1621 [10:43<32:15, 1.47s/it]
19%|ββ | 306/1621 [10:45<31:50, 1.45s/it]
19%|ββ | 307/1621 [10:46<31:42, 1.45s/it]
19%|ββ | 308/1621 [10:47<31:42, 1.45s/it]
19%|ββ | 309/1621 [10:49<31:25, 1.44s/it]
19%|ββ | 310/1621 [10:50<31:20, 1.43s/it]
19%|ββ | 310/1621 [10:50<31:20, 1.43s/it]
19%|ββ | 311/1621 [10:52<31:25, 1.44s/it]
19%|ββ | 312/1621 [10:53<31:24, 1.44s/it]
19%|ββ | 313/1621 [10:55<32:04, 1.47s/it]
19%|ββ | 314/1621 [10:56<32:53, 1.51s/it]
19%|ββ | 315/1621 [10:58<32:18, 1.48s/it]
19%|ββ | 316/1621 [10:59<31:48, 1.46s/it]
20%|ββ | 317/1621 [11:01<31:27, 1.45s/it]
20%|ββ | 318/1621 [11:02<32:08, 1.48s/it]
20%|ββ | 319/1621 [11:04<31:47, 1.47s/it]
20%|ββ | 320/1621 [11:05<31:25, 1.45s/it]
|
|
|
0: [2025-09-02 18:59:03,462] [INFO] [axolotl.core.trainers.base._save:613] [PID:1478787] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0/checkpoint-325[39m |
|
|
0: [2025-09-02 18:59:08,211] [INFO] [axolotl.core.trainers.base._save:662] [PID:1478787] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`[39m |
|
|
0: {'loss': 0.2799, 'grad_norm': 0.3399264182401738, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.2} |
|
|
0: 20%|ββ | 320/1621 [11:05<31:25, 1.45s/it]
20%|ββ | 321/1621 [11:06<31:13, 1.44s/it]
20%|ββ | 322/1621 [11:08<35:05, 1.62s/it]
20%|ββ | 323/1621 [11:10<33:43, 1.56s/it]
20%|ββ | 324/1621 [11:11<32:41, 1.51s/it]
20%|ββ | 325/1621 [11:13<32:05, 1.49s/it]
20%|ββ | 326/1621 [11:23<1:32:18, 4.28s/it]
20%|ββ | 327/1621 [11:25<1:15:29, 3.50s/it]
20%|ββ | 328/1621 [11:27<1:02:02, 2.88s/it]
20%|ββ | 329/1621 [11:28<53:10, 2.47s/it]
20%|ββ | 330/1621 [11:30<46:38, 2.17s/it]
20%|ββ | 330/1621 [11:30<46:38, 2.17s/it]
20%|ββ | 331/1621 [11:31<41:45, 1.94s/it]
20%|ββ | 332/1621 [11:33<39:29, 1.84s/it]
21%|ββ | 333/1621 [11:34<36:48, 1.71s/it]
21%|ββ | 334/1621 [11:36<35:24, 1.65s/it]
21%|ββ | 335/1621 [11:37<34:03, 1.59s/it]
21%|ββ | 336/1621 |
|
|
0: {'loss': 0.268, 'grad_norm': 0.3251493921539059, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.21} |
|
|
0: {'loss': 0.2727, 'grad_norm': 0.3288623888122016, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.22} |
|
|
0: [11:39<34:18, 1.60s/it]
21%|ββ | 337/1621 [11:40<33:44, 1.58s/it]
21%|ββ | 338/1621 [11:42<32:55, 1.54s/it]
21%|ββ | 339/1621 [11:43<32:29, 1.52s/it]
21%|ββ | 340/1621 [11:45<32:21, 1.52s/it]
21%|ββ | 340/1621 [11:45<32:21, 1.52s/it]
21%|ββ | 341/1621 [11:46<33:41, 1.58s/it]
21%|ββ | 342/1621 [11:48<33:55, 1.59s/it]
21%|ββ | 343/1621 [11:49<32:46, 1.54s/it]
21%|ββ | 344/1621 [11:51<32:27, 1.53s/it]
21%|βββ | 345/1621 [11:52<31:48, 1.50s/it]
21%|βββ | 346/1621 [11:54<31:13, 1.47s/it]
21%|βββ | 347/1621 [11:55<31:07, 1.47s/it]
21%|βββ | 348/1621 [11:57<30:46, 1.45s/it]
22%|βββ | 349/1621 [11:58<30:38, 1.45s/it]
22%|βββ | 350/1621 [11:59<30:35, 1.44s/it]
22%|βββ | 350/1621 [11:59<30:35, 1.44s/it]
|
|
|
0: {'loss': 0.2751, 'grad_norm': 0.3024279886733832, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.22} |
|
|
0: 22%|βββ | 351/1621 [12:01<30:23, 1.44s/it]
22%|βββ | 352/1621 [12:02<30:25, 1.44s/it]
22%|βββ | 353/1621 [12:04<30:17, 1.43s/it]
22%|βββ | 354/1621 [12:05<30:10, 1.43s/it]
22%|βββ | 355/1621 [12:07<30:06, 1.43s/it]
22%|βββ | 356/1621 [12:08<30:00, 1.42s/it]
22%|βββ | 357/1621 [12:09<30:09, 1.43s/it]
22%|βββ | 358/1621 [12:11<30:13, 1.44s/it]
22%|βββ | 359/1621 [12:12<30:31, 1.45s/it]
22%|βββ | 360/1621 [12:14<30:22, 1.45s/it]
22%|βββ | 360/1621 [12:14<30:22, 1.45s/it]
22%|βββ | 361/1621 [12:15<30:19, 1.44s/it]
22%|βββ | 362/1621 [12:17<30:30, 1.45s/it]
22%|βββ | 363/1621 [12:18<30:11, 1.44s/it]
22%|βββ | 364/1621 [12:19<30:05, 1.44s/it]
23%|βββ | 365/1621 [12:21<30:27, 1.45s/it]
23%|βββ | 366/1621 [12:22<30:21, 1.45s/it]
23 |
|
|
0: {'loss': 0.2709, 'grad_norm': 0.32115118197616366, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.23} |
|
|
0: {'loss': 0.2705, 'grad_norm': 0.3241814781093123, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.23} |
|
|
0: %|βββ | 367/1621 [12:24<30:09, 1.44s/it]
23%|βββ | 368/1621 [12:25<30:15, 1.45s/it]
23%|βββ | 369/1621 [12:27<30:28, 1.46s/it]
23%|βββ | 370/1621 [12:28<30:24, 1.46s/it]
23%|βββ | 370/1621 [12:28<30:24, 1.46s/it]
23%|βββ | 371/1621 [12:30<30:14, 1.45s/it]
23%|βββ | 372/1621 [12:31<29:52, 1.44s/it]
23%|βββ | 373/1621 [12:33<29:41, 1.43s/it]
23%|βββ | 374/1621 [12:34<29:43, 1.43s/it]
23%|βββ | 375/1621 [12:36<31:24, 1.51s/it]
23%|βββ | 376/1621 [12:37<30:43, 1.48s/it]
23%|βββ | 377/1621 [12:39<30:34, 1.47s/it]
23%|βββ | 378/1621 [12:40<30:17, 1.46s/it]
23%|βββ | 379/1621 [12:41<29:55, 1.45s/it]
23%|βββ | 380/1621 [12:43<29:45, 1.44s/it]
23%|βββ | 380/1621 [12:43<29:45, 1.44s/it]
24%|ββ |
|
|
0: {'loss': 0.2701, 'grad_norm': 0.3202907610900123, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.24} |
|
|
0: β | 381/1621 [12:44<29:42, 1.44s/it]
24%|βββ | 382/1621 [12:46<32:44, 1.59s/it]
24%|βββ | 383/1621 [12:48<31:35, 1.53s/it]
24%|βββ | 384/1621 [12:49<30:44, 1.49s/it]
24%|βββ | 385/1621 [12:50<30:18, 1.47s/it]
24%|βββ | 386/1621 [12:52<30:02, 1.46s/it]
24%|βββ | 387/1621 [12:53<30:11, 1.47s/it]
24%|βββ | 388/1621 [12:55<30:15, 1.47s/it]
24%|βββ | 389/1621 [12:56<31:05, 1.51s/it]
24%|βββ | 390/1621 [12:58<30:29, 1.49s/it]
24%|βββ | 390/1621 [12:58<30:29, 1.49s/it]
24%|βββ | 391/1621 [12:59<30:03, 1.47s/it]
24%|βββ | 392/1621 [13:01<29:45, 1.45s/it]
24%|βββ | 393/1621 [13:02<29:36, 1.45s/it]
24%|βββ | 394/1621 [13:04<31:00, 1.52s/it]
24%|βββ | 395/1621 [13:05<30:50, 1.51s/it]
24%|βββ | 396/1621 [13:07<30:19, 1.49s/it]
24%|βββ |
|
|
0: {'loss': 0.268, 'grad_norm': 0.34656413820425974, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.25} |
|
|
0: {'loss': 0.2643, 'grad_norm': 0.31693873851656673, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.25} |
|
|
0: | 397/1621 [13:08<29:51, 1.46s/it]
25%|βββ | 398/1621 [13:10<32:15, 1.58s/it]
25%|βββ | 399/1621 [13:11<31:10, 1.53s/it]
25%|βββ | 400/1621 [13:13<30:33, 1.50s/it]
25%|βββ | 400/1621 [13:13<30:33, 1.50s/it]
25%|βββ | 401/1621 [13:14<30:38, 1.51s/it]
25%|βββ | 402/1621 [13:16<30:12, 1.49s/it]
25%|βββ | 403/1621 [13:17<29:42, 1.46s/it]
25%|βββ | 404/1621 [13:19<31:40, 1.56s/it]
25%|βββ | 405/1621 [13:20<30:45, 1.52s/it]
25%|βββ | 406/1621 [13:22<31:22, 1.55s/it]
25%|βββ | 407/1621 [13:23<30:36, 1.51s/it]
25%|βββ | 408/1621 [13:25<29:51, 1.48s/it]
25%|βββ | 409/1621 [13:26<29:27, 1.46s/it]
25%|βββ | 410/1621 [13:28<29:15, 1.45s/it]
25%|βββ | 410/1621 [13:28<29:15, 1.45s/it]
25%|βββ | |
|
|
0: {'loss': 0.2672, 'grad_norm': 0.2971789023650299, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.26} |
|
|
0: 411/1621 [13:29<29:03, 1.44s/it]
25%|βββ | 412/1621 [13:31<30:57, 1.54s/it]
25%|βββ | 413/1621 [13:32<30:23, 1.51s/it]
26%|βββ | 414/1621 [13:34<29:45, 1.48s/it]
26%|βββ | 415/1621 [13:35<29:49, 1.48s/it]
26%|βββ | 416/1621 [13:37<29:25, 1.46s/it]
26%|βββ | 417/1621 [13:38<30:17, 1.51s/it]
26%|βββ | 418/1621 [13:40<30:55, 1.54s/it]
26%|βββ | 419/1621 [13:41<30:36, 1.53s/it]
26%|βββ | 420/1621 [13:43<30:11, 1.51s/it]
26%|βββ | 420/1621 [13:43<30:11, 1.51s/it]
26%|βββ | 421/1621 [13:44<30:35, 1.53s/it]
26%|βββ | 422/1621 [13:46<30:04, 1.50s/it]
26%|βββ | 423/1621 [13:47<29:41, 1.49s/it]
26%|βββ | 424/1621 [13:49<29:19, 1.47s/it]
26%|βββ | 425/1621 [13:50<29:01, 1.46s/it]
26%|βββ | 426/1621 [13:52<29:34, 1.48s/it]
26%|βββ | 42 |
|
|
0: {'loss': 0.2613, 'grad_norm': 0.31251578412801284, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.27} |
|
|
0: {'loss': 0.2693, 'grad_norm': 0.30507923848851765, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.27} |
|
|
0: 7/1621 [13:53<29:49, 1.50s/it]
26%|βββ | 428/1621 [13:55<29:39, 1.49s/it]
26%|βββ | 429/1621 [13:56<29:11, 1.47s/it]
27%|βββ | 430/1621 [13:58<30:02, 1.51s/it]
27%|βββ | 430/1621 [13:58<30:02, 1.51s/it]
27%|βββ | 431/1621 [13:59<30:46, 1.55s/it]
27%|βββ | 432/1621 [14:01<30:03, 1.52s/it]
27%|βββ | 433/1621 [14:02<29:30, 1.49s/it]
27%|βββ | 434/1621 [14:04<31:03, 1.57s/it]
27%|βββ | 435/1621 [14:05<30:09, 1.53s/it]
27%|βββ | 436/1621 [14:07<29:34, 1.50s/it]
27%|βββ | 437/1621 [14:08<29:30, 1.50s/it]
27%|βββ | 438/1621 [14:10<30:06, 1.53s/it]
27%|βββ | 439/1621 [14:11<29:32, 1.50s/it]
27%|βββ | 440/1621 [14:13<28:58, 1.47s/it]
27%|βββ | 440/1621 [14:13<28:58, 1.47s/it]
27%|βββ | 441/1621 [ |
|
|
0: {'loss': 0.2694, 'grad_norm': 0.307964171218113, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.28} |
|
|
0: 14:14<28:33, 1.45s/it]
27%|βββ | 442/1621 [14:16<28:20, 1.44s/it]
27%|βββ | 443/1621 [14:17<28:29, 1.45s/it]
27%|βββ | 444/1621 [14:19<28:25, 1.45s/it]
27%|βββ | 445/1621 [14:20<28:24, 1.45s/it]
28%|βββ | 446/1621 [14:21<28:17, 1.44s/it]
28%|βββ | 447/1621 [14:23<28:02, 1.43s/it]
28%|βββ | 448/1621 [14:24<27:58, 1.43s/it]
28%|βββ | 449/1621 [14:26<27:47, 1.42s/it]
28%|βββ | 450/1621 [14:27<27:45, 1.42s/it]
28%|βββ | 450/1621 [14:27<27:45, 1.42s/it]
28%|βββ | 451/1621 [14:28<27:51, 1.43s/it]
28%|βββ | 452/1621 [14:30<27:51, 1.43s/it]
28%|βββ | 453/1621 [14:31<27:49, 1.43s/it]
28%|βββ | 454/1621 [14:33<28:02, 1.44s/it]
28%|βββ | 455/1621 [14:34<27:51, 1.43s/it]
28%|βββ | 456/1621 [14:36<27:43, 1.43s/it]
28%|βββ | 457/1621 [14: |
|
|
0: {'loss': 0.2718, 'grad_norm': 0.3093931433367332, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.28} |
|
|
0: {'loss': 0.2638, 'grad_norm': 0.3165329083358544, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.29} |
|
|
0: 37<27:44, 1.43s/it]
28%|βββ | 458/1621 [14:39<27:49, 1.44s/it]
28%|βββ | 459/1621 [14:40<27:56, 1.44s/it]
28%|βββ | 460/1621 [14:41<27:56, 1.44s/it]
28%|βββ | 460/1621 [14:41<27:56, 1.44s/it]
28%|βββ | 461/1621 [14:43<27:49, 1.44s/it]
29%|βββ | 462/1621 [14:44<27:38, 1.43s/it]
29%|βββ | 463/1621 [14:46<27:29, 1.42s/it]
29%|βββ | 464/1621 [14:47<29:02, 1.51s/it]
29%|βββ | 465/1621 [14:49<28:58, 1.50s/it]
29%|βββ | 466/1621 [14:50<28:34, 1.48s/it]
29%|βββ | 467/1621 [14:52<29:23, 1.53s/it]
29%|βββ | 468/1621 [14:53<28:45, 1.50s/it]
29%|βββ | 469/1621 [14:55<28:16, 1.47s/it]
29%|βββ | 470/1621 [14:56<27:54, 1.45s/it]
29%|βββ | 470/1621 [14:56<27:54, 1.45s/it]
29%|βββ | 471/1621 [14:58<29:34 |
|
|
0: {'loss': 0.2692, 'grad_norm': 0.3101656418893075, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.3} |
|
|
0: , 1.54s/it]
29%|βββ | 472/1621 [14:59<28:51, 1.51s/it]
29%|βββ | 473/1621 [15:01<28:21, 1.48s/it]
29%|βββ | 474/1621 [15:02<28:53, 1.51s/it]
29%|βββ | 475/1621 [15:04<28:27, 1.49s/it]
29%|βββ | 476/1621 [15:05<27:59, 1.47s/it]
29%|βββ | 477/1621 [15:07<27:47, 1.46s/it]
29%|βββ | 478/1621 [15:08<27:28, 1.44s/it]
30%|βββ | 479/1621 [15:10<27:27, 1.44s/it]
30%|βββ | 480/1621 [15:11<27:21, 1.44s/it]
30%|βββ | 480/1621 [15:11<27:21, 1.44s/it]
30%|βββ | 481/1621 [15:13<28:29, 1.50s/it]
30%|βββ | 482/1621 [15:14<27:58, 1.47s/it]
30%|βββ | 483/1621 [15:15<27:39, 1.46s/it]
30%|βββ | 484/1621 [15:17<27:37, 1.46s/it]
30%|βββ | 485/1621 [15:18<27:22, 1.45s/it]
30%|βββ | 486/1621 [15:20<27:17, 1.44s/it]
30%|βββ | 487/1621 [15:21<27:06, |
|
|
0: {'loss': 0.2662, 'grad_norm': 0.30695490828095756, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.3} |
|
|
0: {'loss': 0.2704, 'grad_norm': 0.32199906361703134, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.31} |
|
|
0: 1.43s/it]
30%|βββ | 488/1621 [15:23<27:11, 1.44s/it]
30%|βββ | 489/1621 [15:24<28:34, 1.51s/it]
30%|βββ | 490/1621 [15:26<28:01, 1.49s/it]
30%|βββ | 490/1621 [15:26<28:01, 1.49s/it]
30%|βββ | 491/1621 [15:27<29:22, 1.56s/it]
30%|βββ | 492/1621 [15:29<28:32, 1.52s/it]
30%|βββ | 493/1621 [15:30<27:55, 1.49s/it]
30%|βββ | 494/1621 [15:32<27:45, 1.48s/it]
31%|βββ | 495/1621 [15:33<27:28, 1.46s/it]
31%|βββ | 496/1621 [15:35<27:28, 1.47s/it]
31%|βββ | 497/1621 [15:36<27:33, 1.47s/it]
31%|βββ | 498/1621 [15:38<27:06, 1.45s/it]
31%|βββ | 499/1621 [15:39<27:05, 1.45s/it]
31%|βββ | 500/1621 [15:40<26:51, 1.44s/it]
31%|βββ | 500/1621 [15:40<26:51, 1.44s/it]
31%|βββ | 501/1621 [15:42<26:54, 1.44s/it |
|
|
0: {'loss': 0.2698, 'grad_norm': 0.3168722994354358, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.31} |
|
|
0: ]
31%|βββ | 502/1621 [15:43<27:53, 1.50s/it]
31%|βββ | 503/1621 [15:45<27:32, 1.48s/it]
31%|βββ | 504/1621 [15:46<27:12, 1.46s/it]
31%|βββ | 505/1621 [15:48<26:55, 1.45s/it]
31%|βββ | 506/1621 [15:49<26:47, 1.44s/it]
31%|ββββ | 507/1621 [15:51<26:32, 1.43s/it]
31%|ββββ | 508/1621 [15:52<26:23, 1.42s/it]
31%|ββββ | 509/1621 [15:54<27:44, 1.50s/it]
31%|ββββ | 510/1621 [15:55<27:21, 1.48s/it]
31%|ββββ | 510/1621 [15:55<27:21, 1.48s/it]
32%|ββββ | 511/1621 [15:57<27:29, 1.49s/it]
32%|ββββ | 512/1621 [15:58<27:02, 1.46s/it]
32%|ββββ | 513/1621 [15:59<26:44, 1.45s/it]
32%|ββββ | 514/1621 [16:01<26:32, 1.44s/it]
32%|ββββ | 515/1621 [16:02<26:21, 1.43s/it]
32%|ββββ | 516/1621 [16:04<27:26, 1.49s/it]
32%|ββββ | 517/1621 [1 |
|
|
0: {'loss': 0.2724, 'grad_norm': 0.3419596649304518, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.32} |
|
|
0: {'loss': 0.2691, 'grad_norm': 0.3214826713689335, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.33} |
|
|
0: 6:05<26:58, 1.47s/it]
32%|ββββ | 518/1621 [16:07<26:37, 1.45s/it]
32%|ββββ | 519/1621 [16:08<26:22, 1.44s/it]
32%|ββββ | 520/1621 [16:09<26:13, 1.43s/it]
32%|ββββ | 520/1621 [16:09<26:13, 1.43s/it]
32%|ββββ | 521/1621 [16:11<26:21, 1.44s/it]
32%|ββββ | 522/1621 [16:12<26:11, 1.43s/it]
32%|ββββ | 523/1621 [16:14<26:26, 1.44s/it]
32%|ββββ | 524/1621 [16:15<26:27, 1.45s/it]
32%|ββββ | 525/1621 [16:17<26:41, 1.46s/it]
32%|ββββ | 526/1621 [16:18<27:06, 1.49s/it]
33%|ββββ | 527/1621 [16:20<28:06, 1.54s/it]
33%|ββββ | 528/1621 [16:22<28:58, 1.59s/it]
33%|ββββ | 529/1621 [16:23<28:03, 1.54s/it]
33%|ββββ | 530/1621 [16:25<27:21, 1.50s/it]
33%|ββββ | 530/1621 [16:25<27:21, 1.50s/it]
33%|ββοΏ½ |
|
|
0: {'loss': 0.2723, 'grad_norm': 0.31076689228472376, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.33} |
|
|
0: οΏ½οΏ½β | 531/1621 [16:26<27:16, 1.50s/it]
33%|ββββ | 532/1621 [16:28<27:28, 1.51s/it]
33%|ββββ | 533/1621 [16:29<26:53, 1.48s/it]
33%|ββββ | 534/1621 [16:31<28:29, 1.57s/it]
33%|ββββ | 535/1621 [16:32<27:34, 1.52s/it]
33%|ββββ | 536/1621 [16:34<28:02, 1.55s/it]
33%|ββββ | 537/1621 [16:35<27:27, 1.52s/it]
33%|ββββ | 538/1621 [16:37<27:09, 1.50s/it]
33%|ββββ | 539/1621 [16:38<27:26, 1.52s/it]
33%|ββββ | 540/1621 [16:40<27:29, 1.53s/it]
33%|ββββ | 540/1621 [16:40<27:29, 1.53s/it]
33%|ββββ | 541/1621 [16:41<27:11, 1.51s/it]
33%|ββββ | 542/1621 [16:43<26:35, 1.48s/it]
33%|ββββ | 543/1621 [16:44<26:11, 1.46s/it]
34%|ββββ | 544/1621 [16:46<25:53, 1.44s/it]
34%|ββββ | 545/1621 [16:47<25:43, 1.43s/it]
34%|ββββ | 546/1621 [16:48 |
|
|
0: {'loss': 0.267, 'grad_norm': 0.3096662553218259, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.34} |
|
|
0: {'loss': 0.2652, 'grad_norm': 0.3305774329274305, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.35} |
|
|
0: <25:33, 1.43s/it]
34%|ββββ | 547/1621 [16:50<25:40, 1.43s/it]
34%|ββββ | 548/1621 [16:51<25:37, 1.43s/it]
34%|ββββ | 549/1621 [16:53<25:57, 1.45s/it]
34%|ββββ | 550/1621 [16:54<26:54, 1.51s/it]
34%|ββββ | 550/1621 [16:54<26:54, 1.51s/it]
34%|ββββ | 551/1621 [16:56<26:36, 1.49s/it]
34%|ββββ | 552/1621 [16:57<26:14, 1.47s/it]
34%|ββββ | 553/1621 [16:59<25:54, 1.46s/it]
34%|ββββ | 554/1621 [17:00<25:42, 1.45s/it]
34%|ββββ | 555/1621 [17:02<27:30, 1.55s/it]
34%|ββββ | 556/1621 [17:03<27:11, 1.53s/it]
34%|ββββ | 557/1621 [17:05<26:39, 1.50s/it]
34%|ββββ | 558/1621 [17:06<26:10, 1.48s/it]
34%|ββββ | 559/1621 [17:08<25:48, 1.46s/it]
35%|ββββ | 560/1621 [17:09<26:29, 1.50s/it]
35%|βββοΏ½ |
|
|
0: {'loss': 0.27, 'grad_norm': 0.3183396295658571, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.35} |
|
|
0: οΏ½ | 560/1621 [17:09<26:29, 1.50s/it]
35%|ββββ | 561/1621 [17:11<26:17, 1.49s/it]
35%|ββββ | 562/1621 [17:12<25:50, 1.46s/it]
35%|ββββ | 563/1621 [17:14<25:33, 1.45s/it]
35%|ββββ | 564/1621 [17:15<25:34, 1.45s/it]
35%|ββββ | 565/1621 [17:16<25:23, 1.44s/it]
35%|ββββ | 566/1621 [17:18<25:08, 1.43s/it]
35%|ββββ | 567/1621 [17:19<24:59, 1.42s/it]
35%|ββββ | 568/1621 [17:21<24:56, 1.42s/it]
35%|ββββ | 569/1621 [17:22<26:16, 1.50s/it]
35%|ββββ | 570/1621 [17:24<26:08, 1.49s/it]
35%|ββββ | 570/1621 [17:24<26:08, 1.49s/it]
35%|ββββ | 571/1621 [17:25<26:08, 1.49s/it]
35%|ββββ | 572/1621 [17:27<25:39, 1.47s/it]
35%|ββββ | 573/1621 [17:28<26:19, 1.51s/it]
35%|ββββ | 574/1621 [17:30<26:44, 1.53s/it]
35%|ββββ | 575/1621 [17:31<26: |
|
|
0: {'loss': 0.263, 'grad_norm': 0.3045611958885712, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.36} |
|
|
0: 20, 1.51s/it]
36%|ββββ | 576/1621 [17:33<26:22, 1.51s/it]
36%|ββββ | 577/1621 [17:34<27:06, 1.56s/it]
36%|ββββ | 578/1621 [17:36<27:20, 1.57s/it]
36%|ββββ | 579/1621 [17:38<26:35, 1.53s/it]
36%|ββββ | 580/1621 [17:39<26:03, 1.50s/it]
36%|ββββ | 580/1621 [17:39<26:03, 1.50s/it]
36%|ββββ | 581/1621 [17:40<25:33, 1.47s/it]
36%|ββββ | 582/1621 [17:42<25:19, 1.46s/it]
36%|ββββ | 583/1621 [17:43<25:56, 1.50s/it]
36%|ββββ | 584/1621 [17:45<25:47, 1.49s/it]
36%|ββββ | 585/1621 [17:46<25:23, 1.47s/it]
36%|ββββ | 586/1621 [17:48<25:18, 1.47s/it]
36%|ββββ | 587/1621 [17:50<26:48, 1.56s/it]
36%|ββββ | 588/1621 [17:51<25:59, 1.51s/it]
36%|ββββ | 589/1621 [17:52<25:32, 1.48s/it]
36%|ββββ | 590/1621 [17:54<25:09, 1.46s/it]
|
|
|
0: {'loss': 0.2654, 'grad_norm': 0.30739048454915474, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.36} |
|
|
0: {'loss': 0.2681, 'grad_norm': 0.3042454032840815, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.37} |
|
|
0:
36%|ββββ | 590/1621 [17:54<25:09, 1.46s/it]
36%|ββββ | 591/1621 [17:55<24:57, 1.45s/it]
37%|ββββ | 592/1621 [17:57<24:49, 1.45s/it]
37%|ββββ | 593/1621 [17:58<24:36, 1.44s/it]
37%|ββββ | 594/1621 [17:59<24:39, 1.44s/it]
37%|ββββ | 595/1621 [18:01<24:28, 1.43s/it]
37%|ββββ | 596/1621 [18:02<24:23, 1.43s/it]
37%|ββββ | 597/1621 [18:04<24:33, 1.44s/it]
37%|ββββ | 598/1621 [18:05<24:30, 1.44s/it]
37%|ββββ | 599/1621 [18:07<24:27, 1.44s/it]
37%|ββββ | 600/1621 [18:08<24:21, 1.43s/it]
37%|ββββ | 600/1621 [18:08<24:21, 1.43s/it]
37%|ββββ | 601/1621 [18:10<24:19, 1.43s/it]
37%|ββββ | 602/1621 [18:11<24:54, 1.47s/it]
37%|ββββ | 603/1621 [18:12<24:45, 1.46s/it]
37%|ββββ | 604/1621 [18:14<24:34, |
|
|
0: {'loss': 0.2616, 'grad_norm': 0.30618895900501564, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.38} |
|
|
0: 1.45s/it]
37%|ββββ | 605/1621 [18:15<24:26, 1.44s/it]
37%|ββββ | 606/1621 [18:17<24:15, 1.43s/it]
37%|ββββ | 607/1621 [18:18<24:25, 1.45s/it]
38%|ββββ | 608/1621 [18:20<24:17, 1.44s/it]
38%|ββββ | 609/1621 [18:21<25:29, 1.51s/it]
38%|ββββ | 610/1621 [18:23<25:15, 1.50s/it]
38%|ββββ | 610/1621 [18:23<25:15, 1.50s/it]
38%|ββββ | 611/1621 [18:24<25:04, 1.49s/it]
38%|ββββ | 612/1621 [18:26<25:54, 1.54s/it]
38%|ββββ | 613/1621 [18:27<25:13, 1.50s/it]
38%|ββββ | 614/1621 [18:29<24:53, 1.48s/it]
38%|ββββ | 615/1621 [18:30<24:36, 1.47s/it]
38%|ββββ | 616/1621 [18:32<24:54, 1.49s/it]
38%|ββββ | 617/1621 [18:33<24:41, 1.48s/it]
38%|ββββ | 618/1621 [18:35<24:55, 1.49s/it]
38%|ββββ | 619/1621 [18:36<24:38, 1.48s/it]
38%|ββββ |
|
|
0: {'loss': 0.2613, 'grad_norm': 0.2919650844350329, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.38} |
|
|
0: {'loss': 0.2675, 'grad_norm': 0.3040205468853955, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.39} |
|
|
0: | 620/1621 [18:38<24:27, 1.47s/it]
38%|ββββ | 620/1621 [18:38<24:27, 1.47s/it]
38%|ββββ | 621/1621 [18:39<24:48, 1.49s/it]
38%|ββββ | 622/1621 [18:41<24:25, 1.47s/it]
38%|ββββ | 623/1621 [18:42<24:41, 1.48s/it]
38%|ββββ | 624/1621 [18:44<24:43, 1.49s/it]
39%|ββββ | 625/1621 [18:45<24:14, 1.46s/it]
39%|ββββ | 626/1621 [18:46<24:02, 1.45s/it]
39%|ββββ | 627/1621 [18:48<23:50, 1.44s/it]
39%|ββββ | 628/1621 [18:49<24:02, 1.45s/it]
39%|ββββ | 629/1621 [18:51<25:00, 1.51s/it]
39%|ββββ | 630/1621 [18:52<24:30, 1.48s/it]
39%|ββββ | 630/1621 [18:52<24:30, 1.48s/it]
39%|ββββ | 631/1621 [18:54<24:21, 1.48s/it]
39%|ββββ | 632/1621 [18:55<24:03, 1.46s/it]
39%|ββββ | 633/1621 [18:57<24:11, 1.4 |
|
|
0: {'loss': 0.2699, 'grad_norm': 0.31062647915000946, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.39} |
|
|
0: 7s/it]
39%|ββββ | 634/1621 [18:58<25:13, 1.53s/it]
39%|ββββ | 635/1621 [19:00<24:52, 1.51s/it]
39%|ββββ | 636/1621 [19:01<24:20, 1.48s/it]
39%|ββββ | 637/1621 [19:03<24:52, 1.52s/it]
39%|ββββ | 638/1621 [19:04<24:19, 1.48s/it]
39%|ββββ | 639/1621 [19:06<23:58, 1.47s/it]
39%|ββββ | 640/1621 [19:07<23:49, 1.46s/it]
39%|ββββ | 640/1621 [19:07<23:49, 1.46s/it]
40%|ββββ | 641/1621 [19:09<24:03, 1.47s/it]
40%|ββββ | 642/1621 [19:10<23:51, 1.46s/it]
40%|ββββ | 643/1621 [19:12<23:36, 1.45s/it]
40%|ββββ | 644/1621 [19:13<23:29, 1.44s/it]
40%|ββββ | 645/1621 [19:14<23:22, 1.44s/it]
40%|ββββ | 646/1621 [19:16<23:21, 1.44s/it]
40%|ββββ | 647/1621 [19:17<24:05, 1.48s/it]
40%|ββββ | 648/1621 [19:19<24:15, 1.50s/it]
40%|ββββ |
|
|
0: {'loss': 0.2618, 'grad_norm': 0.3166654431631163, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.4} |
|
|
0: [2025-09-02 19:07:12,499] [INFO] [axolotl.core.trainers.base._save:613] [PID:1478787] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0/checkpoint-650[39m |
|
|
0: [2025-09-02 19:07:17,316] [INFO] [axolotl.core.trainers.base._save:662] [PID:1478787] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`[39m |
|
|
0: {'loss': 0.2676, 'grad_norm': 0.3157907525928946, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.41} |
|
|
0: | 649/1621 [19:20<23:51, 1.47s/it]
40%|ββββ | 650/1621 [19:22<23:59, 1.48s/it]
40%|ββββ | 650/1621 [19:22<23:59, 1.48s/it]
40%|ββββ | 651/1621 [19:32<1:07:48, 4.19s/it]
40%|ββββ | 652/1621 [19:34<54:15, 3.36s/it]
40%|ββββ | 653/1621 [19:35<45:04, 2.79s/it]
40%|ββββ | 654/1621 [19:37<38:26, 2.39s/it]
40%|ββββ | 655/1621 [19:38<33:59, 2.11s/it]
40%|ββββ | 656/1621 [19:40<30:43, 1.91s/it]
41%|ββββ | 657/1621 [19:41<29:56, 1.86s/it]
41%|ββββ | 658/1621 [19:43<27:43, 1.73s/it]
41%|ββββ | 659/1621 [19:44<26:15, 1.64s/it]
41%|ββββ | 660/1621 [19:46<25:07, 1.57s/it]
41%|ββββ | 660/1621 [19:46<25:07, 1.57s/it]
41%|ββββ | 661/1621 [19:47<24:20, 1.52s/it]
41%|ββββ | 662/1621 [19:49<24:15, 1.5 |
|
|
0: {'loss': 0.2635, 'grad_norm': 0.30820407823108603, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.41} |
|
|
0: 2s/it]
41%|ββββ | 663/1621 [19:50<23:57, 1.50s/it]
41%|ββββ | 664/1621 [19:51<23:39, 1.48s/it]
41%|ββββ | 665/1621 [19:53<24:12, 1.52s/it]
41%|ββββ | 666/1621 [19:55<23:55, 1.50s/it]
41%|ββββ | 667/1621 [19:56<25:02, 1.57s/it]
41%|ββββ | 668/1621 [19:58<25:12, 1.59s/it]
41%|βββββ | 669/1621 [19:59<24:22, 1.54s/it]
41%|βββββ | 670/1621 [20:01<23:43, 1.50s/it]
41%|βββββ | 670/1621 [20:01<23:43, 1.50s/it]
41%|βββββ | 671/1621 [20:02<23:20, 1.47s/it]
41%|βββββ | 672/1621 [20:04<23:01, 1.46s/it]
42%|βββββ | 673/1621 [20:05<22:55, 1.45s/it]
42%|βββββ | 674/1621 [20:06<22:44, 1.44s/it]
42%|βββββ | 675/1621 [20:08<22:36, 1.43s/it]
42%|βββββ | 676/1621 [20:09<22:27, 1.43s/it]
42%|βββββ | 677/1621 [20:11<23:34, 1.50s/it]
|
|
|
0: {'loss': 0.2711, 'grad_norm': 0.3083107428928576, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.42} |
|
|
0: {'loss': 0.2663, 'grad_norm': 0.3239813055521283, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.43} |
|
|
0: 42%|βββββ | 678/1621 [20:13<24:26, 1.56s/it]
42%|βββββ | 679/1621 [20:14<23:53, 1.52s/it]
42%|βββββ | 680/1621 [20:15<23:21, 1.49s/it]
42%|βββββ | 680/1621 [20:15<23:21, 1.49s/it]
42%|βββββ | 681/1621 [20:17<24:32, 1.57s/it]
42%|βββββ | 682/1621 [20:19<24:05, 1.54s/it]
42%|βββββ | 683/1621 [20:20<23:51, 1.53s/it]
42%|βββββ | 684/1621 [20:22<23:35, 1.51s/it]
42%|βββββ | 685/1621 [20:23<23:05, 1.48s/it]
42%|βββββ | 686/1621 [20:24<22:43, 1.46s/it]
42%|βββββ | 687/1621 [20:26<22:29, 1.44s/it]
42%|βββββ | 688/1621 [20:27<22:28, 1.44s/it]
43%|βββββ | 689/1621 [20:29<22:17, 1.44s/it]
43%|βββββ | 690/1621 [20:30<22:19, 1.44s/it]
43%|βββββ | 690/1621 [20:30<22:19, 1.44s/it]
43%|οΏ½ |
|
|
0: {'loss': 0.2664, 'grad_norm': 0.3048360867705897, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.43} |
|
|
0: οΏ½οΏ½ββββ | 691/1621 [20:32<22:08, 1.43s/it]
43%|βββββ | 692/1621 [20:33<21:59, 1.42s/it]
43%|βββββ | 693/1621 [20:34<21:54, 1.42s/it]
43%|βββββ | 694/1621 [20:36<21:49, 1.41s/it]
43%|βββββ | 695/1621 [20:37<21:46, 1.41s/it]
43%|βββββ | 696/1621 [20:39<22:08, 1.44s/it]
43%|βββββ | 697/1621 [20:40<22:00, 1.43s/it]
43%|βββββ | 698/1621 [20:41<21:51, 1.42s/it]
43%|βββββ | 699/1621 [20:43<21:43, 1.41s/it]
43%|βββββ | 700/1621 [20:44<21:44, 1.42s/it]
43%|βββββ | 700/1621 [20:44<21:44, 1.42s/it]
43%|βββββ | 701/1621 [20:46<21:50, 1.42s/it]
43%|βββββ | 702/1621 [20:47<21:53, 1.43s/it]
43%|βββββ | 703/1621 [20:49<21:51, 1.43s/it]
43%|βββββ | 704/1621 [20:50<21:44, 1.42s/it]
43%|βββββ | 705/1621 [20:51<21:42, 1.42s/it]
4 |
|
|
0: {'loss': 0.264, 'grad_norm': 0.2919821903782183, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.44} |
|
|
0: 4%|βββββ | 706/1621 [20:53<22:05, 1.45s/it]
44%|βββββ | 707/1621 [20:55<23:19, 1.53s/it]
44%|βββββ | 708/1621 [20:56<22:46, 1.50s/it]
44%|βββββ | 709/1621 [20:58<22:52, 1.50s/it]
44%|βββββ | 710/1621 [20:59<22:28, 1.48s/it]
44%|βββββ | 710/1621 [20:59<22:28, 1.48s/it]
44%|βββββ | 711/1621 [21:01<23:08, 1.53s/it]
44%|βββββ | 712/1621 [21:02<22:43, 1.50s/it]
44%|βββββ | 713/1621 [21:04<22:17, 1.47s/it]
44%|βββββ | 714/1621 [21:05<21:58, 1.45s/it]
44%|βββββ | 715/1621 [21:06<22:07, 1.46s/it]
44%|βββββ | 716/1621 [21:08<22:28, 1.49s/it]
44%|βββββ | 717/1621 [21:09<22:01, 1.46s/it]
44%|βββββ | 718/1621 [21:11<21:58, 1.46s/it]
44%|βββββ | 719/1621 [21:12<21:50, 1.45s/it]
44%|βββββ | 720/1621 [21:14<21:39, 1.44s/it |
|
|
0: {'loss': 0.2624, 'grad_norm': 0.30328239047498634, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.44} |
|
|
0: {'loss': 0.2642, 'grad_norm': 0.29982742337438895, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.45} |
|
|
0: ]
44%|βββββ | 720/1621 [21:14<21:39, 1.44s/it]
44%|βββββ | 721/1621 [21:15<21:48, 1.45s/it]
45%|βββββ | 722/1621 [21:17<21:43, 1.45s/it]
45%|βββββ | 723/1621 [21:18<22:12, 1.48s/it]
45%|βββββ | 724/1621 [21:20<21:53, 1.46s/it]
45%|βββββ | 725/1621 [21:21<21:42, 1.45s/it]
45%|βββββ | 726/1621 [21:22<21:32, 1.44s/it]
45%|βββββ | 727/1621 [21:24<21:26, 1.44s/it]
45%|βββββ | 728/1621 [21:25<21:19, 1.43s/it]
45%|βββββ | 729/1621 [21:27<21:27, 1.44s/it]
45%|βββββ | 730/1621 [21:28<21:19, 1.44s/it]
45%|βββββ | 730/1621 [21:28<21:19, 1.44s/it]
45%|βββββ | 731/1621 [21:30<21:17, 1.44s/it]
45%|βββββ | 732/1621 [21:31<21:09, 1.43s/it]
45%|βββββ | 733/1621 [21:33<21:36, 1.46s/it]
45 |
|
|
0: {'loss': 0.2673, 'grad_norm': 0.30890762426068, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.46} |
|
|
0: %|βββββ | 734/1621 [21:34<22:02, 1.49s/it]
45%|βββββ | 735/1621 [21:36<22:31, 1.53s/it]
45%|βββββ | 736/1621 [21:37<22:03, 1.50s/it]
45%|βββββ | 737/1621 [21:39<21:43, 1.47s/it]
46%|βββββ | 738/1621 [21:40<21:43, 1.48s/it]
46%|βββββ | 739/1621 [21:42<22:17, 1.52s/it]
46%|βββββ | 740/1621 [21:43<21:51, 1.49s/it]
46%|βββββ | 740/1621 [21:43<21:51, 1.49s/it]
46%|βββββ | 741/1621 [21:45<21:39, 1.48s/it]
46%|βββββ | 742/1621 [21:46<21:27, 1.46s/it]
46%|βββββ | 743/1621 [21:47<21:11, 1.45s/it]
46%|βββββ | 744/1621 [21:49<20:59, 1.44s/it]
46%|βββββ | 745/1621 [21:50<20:49, 1.43s/it]
46%|βββββ | 746/1621 [21:52<20:45, 1.42s/it]
46%|βββββ | 747/1621 [21:53<20:59, 1.44s/it]
46%|βββββ | 748/1621 [21:55<20:57, 1.44s/it] |
|
|
0: {'loss': 0.2591, 'grad_norm': 0.30747963031394887, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.46} |
|
|
0: {'loss': 0.2642, 'grad_norm': 0.30645470994710144, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.47} |
|
|
0:
46%|βββββ | 749/1621 [21:56<20:53, 1.44s/it]
46%|βββββ | 750/1621 [21:57<20:47, 1.43s/it]
46%|βββββ | 750/1621 [21:57<20:47, 1.43s/it]
46%|βββββ | 751/1621 [21:59<21:09, 1.46s/it]
46%|βββββ | 752/1621 [22:00<21:08, 1.46s/it]
46%|βββββ | 753/1621 [22:02<20:57, 1.45s/it]
47%|βββββ | 754/1621 [22:03<20:46, 1.44s/it]
47%|βββββ | 755/1621 [22:05<20:42, 1.43s/it]
47%|βββββ | 756/1621 [22:06<20:59, 1.46s/it]
47%|βββββ | 757/1621 [22:08<20:53, 1.45s/it]
47%|βββββ | 758/1621 [22:09<21:05, 1.47s/it]
47%|βββββ | 759/1621 [22:10<20:54, 1.45s/it]
47%|βββββ | 760/1621 [22:12<20:42, 1.44s/it]
47%|βββββ | 760/1621 [22:12<20:42, 1.44s/it]
47%|βββββ | 761/1621 [22:13<20:34, 1.44s/it]
47% |
|
|
0: {'loss': 0.2636, 'grad_norm': 0.3113139232378856, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.48} |
|
|
0: |βββββ | 762/1621 [22:15<21:09, 1.48s/it]
47%|βββββ | 763/1621 [22:16<20:51, 1.46s/it]
47%|βββββ | 764/1621 [22:18<21:57, 1.54s/it]
47%|βββββ | 765/1621 [22:19<21:21, 1.50s/it]
47%|βββββ | 766/1621 [22:21<21:06, 1.48s/it]
47%|βββββ | 767/1621 [22:22<20:53, 1.47s/it]
47%|βββββ | 768/1621 [22:24<20:52, 1.47s/it]
47%|βββββ | 769/1621 [22:25<20:38, 1.45s/it]
48%|βββββ | 770/1621 [22:27<20:30, 1.45s/it]
48%|βββββ | 770/1621 [22:27<20:30, 1.45s/it]
48%|βββββ | 771/1621 [22:28<20:23, 1.44s/it]
48%|βββββ | 772/1621 [22:29<20:21, 1.44s/it]
48%|βββββ | 773/1621 [22:31<20:30, 1.45s/it]
48%|βββββ | 774/1621 [22:32<20:20, 1.44s/it]
48%|βββββ | 775/1621 [22:34<20:38, 1.46s/it]
48%|βββββ | 776/1621 [22:35<20:32, 1.46s/it]
|
|
|
0: {'loss': 0.2548, 'grad_norm': 0.29543505645103424, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.48} |
|
|
0: {'loss': 0.2602, 'grad_norm': 0.29282257246610377, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.49} |
|
|
0: 48%|βββββ | 777/1621 [22:37<20:23, 1.45s/it]
48%|βββββ | 778/1621 [22:38<21:04, 1.50s/it]
48%|βββββ | 779/1621 [22:40<20:51, 1.49s/it]
48%|βββββ | 780/1621 [22:41<20:45, 1.48s/it]
48%|βββββ | 780/1621 [22:41<20:45, 1.48s/it]
48%|βββββ | 781/1621 [22:43<20:41, 1.48s/it]
48%|βββββ | 782/1621 [22:44<20:28, 1.46s/it]
48%|βββββ | 783/1621 [22:46<20:15, 1.45s/it]
48%|βββββ | 784/1621 [22:47<21:18, 1.53s/it]
48%|βββββ | 785/1621 [22:49<20:52, 1.50s/it]
48%|βββββ | 786/1621 [22:50<20:43, 1.49s/it]
49%|βββββ | 787/1621 [22:52<20:25, 1.47s/it]
49%|βββββ | 788/1621 [22:53<20:27, 1.47s/it]
49%|βββββ | 789/1621 [22:55<20:18, 1.46s/it]
49%|βββββ | 790/1621 [22:56<20:58, 1.51s/it]
49%| |
|
|
0: {'loss': 0.2561, 'grad_norm': 0.30099682495532426, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.49} |
|
|
0: βββββ | 790/1621 [22:56<20:58, 1.51s/it]
49%|βββββ | 791/1621 [22:58<20:33, 1.49s/it]
49%|βββββ | 792/1621 [22:59<21:12, 1.53s/it]
49%|βββββ | 793/1621 [23:01<22:04, 1.60s/it]
49%|βββββ | 794/1621 [23:02<21:14, 1.54s/it]
49%|βββββ | 795/1621 [23:04<21:02, 1.53s/it]
49%|βββββ | 796/1621 [23:05<20:32, 1.49s/it]
49%|βββββ | 797/1621 [23:07<20:20, 1.48s/it]
49%|βββββ | 798/1621 [23:08<20:50, 1.52s/it]
49%|βββββ | 799/1621 [23:10<20:49, 1.52s/it]
49%|βββββ | 800/1621 [23:11<20:41, 1.51s/it]
49%|βββββ | 800/1621 [23:11<20:41, 1.51s/it]
49%|βββββ | 801/1621 [23:13<21:04, 1.54s/it]
49%|βββββ | 802/1621 [23:15<20:48, 1.52s/it]
50%|βββββ | 803/1621 [23:16<20:39, 1.52s/it]
50%|βββββ | 804/1621 [23:17<20:16, 1.49s/it]
|
|
|
0: {'loss': 0.2608, 'grad_norm': 0.28990308697565914, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.5} |
|
|
0: 50%|βββββ | 805/1621 [23:19<19:57, 1.47s/it]
50%|βββββ | 806/1621 [23:20<19:40, 1.45s/it]
50%|βββββ | 807/1621 [23:22<19:49, 1.46s/it]
50%|βββββ | 808/1621 [23:23<19:48, 1.46s/it]
50%|βββββ | 809/1621 [23:25<19:38, 1.45s/it]
50%|βββββ | 810/1621 [23:26<19:42, 1.46s/it]
50%|βββββ | 810/1621 [23:26<19:42, 1.46s/it]
50%|βββββ | 811/1621 [23:28<19:33, 1.45s/it]
50%|βββββ | 812/1621 [23:29<19:29, 1.45s/it]
50%|βββββ | 813/1621 [23:30<19:39, 1.46s/it]
50%|βββββ | 814/1621 [23:32<19:42, 1.47s/it]
50%|βββββ | 815/1621 [23:33<19:50, 1.48s/it]
50%|βββββ | 816/1621 [23:35<19:37, 1.46s/it]
50%|βββββ | 817/1621 [23:36<19:29, 1.45s/it]
50%|βββββ | 818/1621 [23:38<19:23, 1.45s/it]
51%|βββββ | 819/1621 [23:39<20:23, 1.53s/i |
|
|
0: {'loss': 0.2593, 'grad_norm': 0.30472493668896977, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.51} |
|
|
0: {'loss': 0.2672, 'grad_norm': 0.32931684813090084, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.51} |
|
|
0: t]
51%|βββββ | 820/1621 [23:41<19:58, 1.50s/it]
51%|βββββ | 820/1621 [23:41<19:58, 1.50s/it]
51%|βββββ | 821/1621 [23:42<20:06, 1.51s/it]
51%|βββββ | 822/1621 [23:44<19:41, 1.48s/it]
51%|βββββ | 823/1621 [23:45<19:26, 1.46s/it]
51%|βββββ | 824/1621 [23:47<19:14, 1.45s/it]
51%|βββββ | 825/1621 [23:48<19:58, 1.51s/it]
51%|βββββ | 826/1621 [23:50<19:46, 1.49s/it]
51%|βββββ | 827/1621 [23:51<19:27, 1.47s/it]
51%|βββββ | 828/1621 [23:53<19:17, 1.46s/it]
51%|βββββ | 829/1621 [23:54<19:03, 1.44s/it]
51%|βββββ | 830/1621 [23:55<18:52, 1.43s/it]
51%|βββββ | 830/1621 [23:55<18:52, 1.43s/it]
51%|ββββββ | 831/1621 [23:57<18:54, 1.44s/it]
51%|ββββββ | 832/1621 [23:58<18:47, 1.43s/it |
|
|
0: {'loss': 0.2649, 'grad_norm': 0.3394419476608706, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.52} |
|
|
0: ]
51%|ββββββ | 833/1621 [24:00<18:41, 1.42s/it]
51%|ββββββ | 834/1621 [24:01<18:41, 1.43s/it]
52%|ββββββ | 835/1621 [24:03<19:05, 1.46s/it]
52%|ββββββ | 836/1621 [24:04<19:00, 1.45s/it]
52%|ββββββ | 837/1621 [24:06<19:07, 1.46s/it]
52%|ββββββ | 838/1621 [24:07<20:22, 1.56s/it]
52%|ββββββ | 839/1621 [24:09<20:50, 1.60s/it]
52%|ββββββ | 840/1621 [24:11<20:13, 1.55s/it]
52%|ββββββ | 840/1621 [24:11<20:13, 1.55s/it]
52%|ββββββ | 841/1621 [24:12<20:25, 1.57s/it]
52%|ββββββ | 842/1621 [24:14<19:56, 1.54s/it]
52%|ββββββ | 843/1621 [24:15<19:25, 1.50s/it]
52%|ββββββ | 844/1621 [24:16<19:02, 1.47s/it]
52%|ββββββ | 845/1621 [24:18<19:54, 1.54s/it]
52%|ββββββ | 846/1621 [24:20<19:29, 1.51s/it]
52%|ββββββ |
|
|
0: {'loss': 0.2671, 'grad_norm': 0.2952695242398675, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.52} |
|
|
0: {'loss': 0.264, 'grad_norm': 0.29373714171314075, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.53} |
|
|
0: | 847/1621 [24:21<19:08, 1.48s/it]
52%|ββββββ | 848/1621 [24:22<18:48, 1.46s/it]
52%|ββββββ | 849/1621 [24:24<19:33, 1.52s/it]
52%|ββββββ | 850/1621 [24:25<19:12, 1.49s/it]
52%|ββββββ | 850/1621 [24:25<19:12, 1.49s/it]
52%|ββββββ | 851/1621 [24:27<18:59, 1.48s/it]
53%|ββββββ | 852/1621 [24:28<18:47, 1.47s/it]
53%|ββββββ | 853/1621 [24:30<18:50, 1.47s/it]
53%|ββββββ | 854/1621 [24:31<18:51, 1.48s/it]
53%|ββββββ | 855/1621 [24:33<19:00, 1.49s/it]
53%|ββββββ | 856/1621 [24:34<18:42, 1.47s/it]
53%|ββββββ | 857/1621 [24:36<18:31, 1.45s/it]
53%|ββββββ | 858/1621 [24:37<18:22, 1.44s/it]
53%|ββββββ | 859/1621 [24:39<18:13, 1.44s/it]
53%|ββββββ | 860/1621 [24:40<18:07, 1.43s/it]
|
|
|
0: {'loss': 0.2591, 'grad_norm': 0.2933249021776934, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.54} |
|
|
0: 53%|ββββββ | 860/1621 [24:40<18:07, 1.43s/it]
53%|ββββββ | 861/1621 [24:41<18:09, 1.43s/it]
53%|ββββββ | 862/1621 [24:43<18:04, 1.43s/it]
53%|ββββββ | 863/1621 [24:44<18:10, 1.44s/it]
53%|ββββββ | 864/1621 [24:46<18:52, 1.50s/it]
53%|ββββββ | 865/1621 [24:47<18:33, 1.47s/it]
53%|ββββββ | 866/1621 [24:49<18:23, 1.46s/it]
53%|ββββββ | 867/1621 [24:50<18:17, 1.46s/it]
54%|ββββββ | 868/1621 [24:52<18:29, 1.47s/it]
54%|ββββββ | 869/1621 [24:53<18:18, 1.46s/it]
54%|ββββββ | 870/1621 [24:55<19:23, 1.55s/it]
54%|ββββββ | 870/1621 [24:55<19:23, 1.55s/it]
54%|ββββββ | 871/1621 [24:56<18:54, 1.51s/it]
54%|ββββββ | 872/1621 [24:58<18:38, 1.49s/it]
54%|ββββββ | 873/1621 [24:59<18:37, 1.49s/it]
54%|ββββββ |
|
|
0: {'loss': 0.2644, 'grad_norm': 0.30874390029265875, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.54} |
|
|
0: | 874/1621 [25:01<18:27, 1.48s/it]
54%|ββββββ | 875/1621 [25:02<18:10, 1.46s/it]
54%|ββββββ | 876/1621 [25:04<18:33, 1.49s/it]
54%|ββββββ | 877/1621 [25:05<18:22, 1.48s/it]
54%|ββββββ | 878/1621 [25:07<18:09, 1.47s/it]
54%|ββββββ | 879/1621 [25:08<18:07, 1.47s/it]
54%|ββββββ | 880/1621 [25:09<17:55, 1.45s/it]
54%|ββββββ | 880/1621 [25:09<17:55, 1.45s/it]
54%|ββββββ | 881/1621 [25:11<17:48, 1.44s/it]
54%|ββββββ | 882/1621 [25:12<18:13, 1.48s/it]
54%|ββββββ | 883/1621 [25:14<18:00, 1.46s/it]
55%|ββββββ | 884/1621 [25:15<17:47, 1.45s/it]
55%|ββββββ | 885/1621 [25:17<17:38, 1.44s/it]
55%|ββββββ | 886/1621 [25:18<17:44, 1.45s/it]
55%|ββββββ | 887/1621 [25:20<17:38, 1.44s/it]
55%|ββββββ | 888/1621 [25:21<18:40, 1 |
|
|
0: {'loss': 0.2605, 'grad_norm': 0.334940089088556, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.55} |
|
|
0: {'loss': 0.2684, 'grad_norm': 0.3085779318464317, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.56} |
|
|
0: .53s/it]
55%|ββββββ | 889/1621 [25:23<18:22, 1.51s/it]
55%|ββββββ | 890/1621 [25:24<18:20, 1.51s/it]
55%|ββββββ | 890/1621 [25:24<18:20, 1.51s/it]
55%|ββββββ | 891/1621 [25:26<18:14, 1.50s/it]
55%|ββββββ | 892/1621 [25:27<17:59, 1.48s/it]
55%|ββββββ | 893/1621 [25:29<17:53, 1.47s/it]
55%|ββββββ | 894/1621 [25:30<17:41, 1.46s/it]
55%|ββββββ | 895/1621 [25:32<17:28, 1.44s/it]
55%|ββββββ | 896/1621 [25:33<17:17, 1.43s/it]
55%|ββββββ | 897/1621 [25:34<17:17, 1.43s/it]
55%|ββββββ | 898/1621 [25:36<17:12, 1.43s/it]
55%|ββββββ | 899/1621 [25:37<17:08, 1.42s/it]
56%|ββββββ | 900/1621 [25:39<18:02, 1.50s/it]
56%|ββββββ | 900/1621 [25:39<18:02, 1.50s/it]
56%|ββββββ | |
|
|
0: {'loss': 0.2617, 'grad_norm': 0.3477064834521338, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.56} |
|
|
0: 901/1621 [25:40<17:53, 1.49s/it]
56%|ββββββ | 902/1621 [25:42<17:34, 1.47s/it]
56%|ββββββ | 903/1621 [25:43<17:22, 1.45s/it]
56%|ββββββ | 904/1621 [25:45<17:11, 1.44s/it]
56%|ββββββ | 905/1621 [25:46<17:11, 1.44s/it]
56%|ββββββ | 906/1621 [25:47<17:05, 1.43s/it]
56%|ββββββ | 907/1621 [25:49<17:01, 1.43s/it]
56%|ββββββ | 908/1621 [25:50<17:02, 1.43s/it]
56%|ββββββ | 909/1621 [25:52<16:57, 1.43s/it]
56%|ββββββ | 910/1621 [25:53<16:56, 1.43s/it]
56%|ββββββ | 910/1621 [25:53<16:56, 1.43s/it]
56%|ββββββ | 911/1621 [25:55<17:24, 1.47s/it]
56%|ββββββ | 912/1621 [25:56<17:12, 1.46s/it]
56%|ββββββ | 913/1621 [25:58<17:01, 1.44s/it]
56%|ββββββ | 914/1621 [25:59<16:54, 1.43s/it]
56%|ββββββ | 915/1621 [26:00<16:51, 1.4 |
|
|
0: {'loss': 0.2648, 'grad_norm': 0.30501036108316015, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.57} |
|
|
0: 3s/it]
57%|ββββββ | 916/1621 [26:02<16:43, 1.42s/it]
57%|ββββββ | 917/1621 [26:03<16:37, 1.42s/it]
57%|ββββββ | 918/1621 [26:05<16:36, 1.42s/it]
57%|ββββββ | 919/1621 [26:06<16:35, 1.42s/it]
57%|ββββββ | 920/1621 [26:07<16:36, 1.42s/it]
57%|ββββββ | 920/1621 [26:07<16:36, 1.42s/it]
57%|ββββββ | 921/1621 [26:09<16:35, 1.42s/it]
57%|ββββββ | 922/1621 [26:10<16:34, 1.42s/it]
57%|ββββββ | 923/1621 [26:12<16:43, 1.44s/it]
57%|ββββββ | 924/1621 [26:13<16:43, 1.44s/it]
57%|ββββββ | 925/1621 [26:15<17:29, 1.51s/it]
57%|ββββββ | 926/1621 [26:16<17:09, 1.48s/it]
57%|ββββββ | 927/1621 [26:18<17:02, 1.47s/it]
57%|ββββββ | 928/1621 [26:19<16:50, 1.46s/it]
57%|ββββββ | 929/1621 [26:21<16:44, 1.45s/it]
57%|βββββ |
|
|
0: {'loss': 0.2639, 'grad_norm': 0.29127250759681794, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.57} |
|
|
0: {'loss': 0.2578, 'grad_norm': 0.32113325015435573, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.58} |
|
|
0: β | 930/1621 [26:22<16:36, 1.44s/it]
57%|ββββββ | 930/1621 [26:22<16:36, 1.44s/it]
57%|ββββββ | 931/1621 [26:23<16:33, 1.44s/it]
57%|ββββββ | 932/1621 [26:25<16:27, 1.43s/it]
58%|ββββββ | 933/1621 [26:26<16:49, 1.47s/it]
58%|ββββββ | 934/1621 [26:28<16:55, 1.48s/it]
58%|ββββββ | 935/1621 [26:29<16:39, 1.46s/it]
58%|ββββββ | 936/1621 [26:31<16:48, 1.47s/it]
58%|ββββββ | 937/1621 [26:33<17:21, 1.52s/it]
58%|ββββββ | 938/1621 [26:34<17:11, 1.51s/it]
58%|ββββββ | 939/1621 [26:35<17:02, 1.50s/it]
58%|ββββββ | 940/1621 [26:37<16:43, 1.47s/it]
58%|ββββββ | 940/1621 [26:37<16:43, 1.47s/it]
58%|ββββββ | 941/1621 [26:39<17:13, 1.52s/it]
58%|ββββββ | 942/1621 [26:40<17:13, 1.52s |
|
|
0: {'loss': 0.2557, 'grad_norm': 0.29622986437781335, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.59} |
|
|
0: /it]
58%|ββββββ | 943/1621 [26:41<16:51, 1.49s/it]
58%|ββββββ | 944/1621 [26:43<16:36, 1.47s/it]
58%|ββββββ | 945/1621 [26:44<16:31, 1.47s/it]
58%|ββββββ | 946/1621 [26:46<16:23, 1.46s/it]
58%|ββββββ | 947/1621 [26:47<16:17, 1.45s/it]
58%|ββββββ | 948/1621 [26:49<16:09, 1.44s/it]
59%|ββββββ | 949/1621 [26:50<16:20, 1.46s/it]
59%|ββββββ | 950/1621 [26:52<16:26, 1.47s/it]
59%|ββββββ | 950/1621 [26:52<16:26, 1.47s/it]
59%|ββββββ | 951/1621 [26:53<16:15, 1.46s/it]
59%|ββββββ | 952/1621 [26:54<16:06, 1.45s/it]
59%|ββββββ | 953/1621 [26:56<16:42, 1.50s/it]
59%|ββββββ | 954/1621 [26:58<16:25, 1.48s/it]
59%|ββββββ | 955/1621 [26:59<16:13, 1.46s/it]
59%|ββββββ | 956/1621 [27:00<16:10, 1.46s/it]
59%|βββββοΏ½ |
|
|
0: {'loss': 0.2527, 'grad_norm': 0.3334298807383501, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.59} |
|
|
0: οΏ½ | 957/1621 [27:02<16:37, 1.50s/it]
59%|ββββββ | 958/1621 [27:04<16:36, 1.50s/it]
59%|ββββββ | 959/1621 [27:05<17:28, 1.58s/it]
59%|ββββββ | 960/1621 [27:07<17:25, 1.58s/it]
59%|ββββββ | 960/1621 [27:07<17:25, 1.58s/it]
59%|ββββββ | 961/1621 [27:08<16:47, 1.53s/it]
59%|ββββββ | 962/1621 [27:10<16:25, 1.50s/it]
59%|ββββββ | 963/1621 [27:11<16:13, 1.48s/it]
59%|ββββββ | 964/1621 [27:13<16:13, 1.48s/it]
60%|ββββββ | 965/1621 [27:14<16:01, 1.47s/it]
60%|ββββββ | 966/1621 [27:15<15:51, 1.45s/it]
60%|ββββββ | 967/1621 [27:17<15:45, 1.45s/it]
60%|ββββββ | 968/1621 [27:18<15:38, 1.44s/it]
60%|ββββββ | 969/1621 [27:20<15:41, 1.44s/it]
60%|ββββββ | 970/1621 [27:21<15:37, 1.44s/it]
|
|
|
0: {'loss': 0.2671, 'grad_norm': 0.3152370154555651, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.6} |
|
|
0: [2025-09-02 19:15:19,083] [INFO] [axolotl.core.trainers.base._save:613] [PID:1478787] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0/checkpoint-975[39m |
|
|
0: [2025-09-02 19:15:23,850] [INFO] [axolotl.core.trainers.base._save:662] [PID:1478787] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`[39m |
|
|
0: {'loss': 0.2586, 'grad_norm': 0.2761551896866444, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.6} |
|
|
0:
60%|ββββββ | 970/1621 [27:21<15:37, 1.44s/it]
60%|ββββββ | 971/1621 [27:23<15:29, 1.43s/it]
60%|ββββββ | 972/1621 [27:24<15:25, 1.43s/it]
60%|ββββββ | 973/1621 [27:25<15:22, 1.42s/it]
60%|ββββββ | 974/1621 [27:27<15:26, 1.43s/it]
60%|ββββββ | 975/1621 [27:28<15:25, 1.43s/it]
60%|ββββββ | 976/1621 [27:39<45:02, 4.19s/it]
60%|ββββββ | 977/1621 [27:40<36:09, 3.37s/it]
60%|ββββββ | 978/1621 [27:42<29:54, 2.79s/it]
60%|ββββββ | 979/1621 [27:43<25:37, 2.39s/it]
60%|ββββββ | 980/1621 [27:45<22:35, 2.11s/it]
60%|ββββββ | 980/1621 [27:45<22:35, 2.11s/it]
61%|ββββββ | 981/1621 [27:46<20:20, 1.91s/it]
61%|ββββββ | 982/1621 [27:48<18:46, 1.76s/it]
61%|ββββββ | 983/1621 [27:49<17:43, 1.67s/it]
61%|ββββββ |
|
|
0: {'loss': 0.2666, 'grad_norm': 0.3193029520192468, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.61} |
|
|
0: | 984/1621 [27:51<17:39, 1.66s/it]
61%|ββββββ | 985/1621 [27:52<16:48, 1.59s/it]
61%|ββββββ | 986/1621 [27:54<16:12, 1.53s/it]
61%|ββββββ | 987/1621 [27:55<15:53, 1.50s/it]
61%|ββββββ | 988/1621 [27:56<15:32, 1.47s/it]
61%|ββββββ | 989/1621 [27:58<16:08, 1.53s/it]
61%|ββββββ | 990/1621 [28:00<16:21, 1.56s/it]
61%|ββββββ | 990/1621 [28:00<16:21, 1.56s/it]
61%|ββββββ | 991/1621 [28:01<15:53, 1.51s/it]
61%|ββββββ | 992/1621 [28:03<15:45, 1.50s/it]
61%|βββββββ | 993/1621 [28:04<15:27, 1.48s/it]
61%|βββββββ | 994/1621 [28:05<15:21, 1.47s/it]
61%|βββββββ | 995/1621 [28:07<15:09, 1.45s/it]
61%|βββββββ | 996/1621 [28:08<15:00, 1.44s/it]
62%|βββββββ | 997/1621 [28:10<15:00, 1.44s/it]
62%|βββββββ | 998/1621 [ |
|
|
0: {'loss': 0.2576, 'grad_norm': 0.32033329683867157, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.62} |
|
|
0: {'loss': 0.2554, 'grad_norm': 0.324086242099498, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.62} |
|
|
0: 28:11<15:41, 1.51s/it]
62%|βββββββ | 999/1621 [28:13<15:23, 1.48s/it]
62%|βββββββ | 1000/1621 [28:14<15:08, 1.46s/it]
62%|βββββββ | 1000/1621 [28:14<15:08, 1.46s/it]
62%|βββββββ | 1001/1621 [28:16<14:59, 1.45s/it]
62%|βββββββ | 1002/1621 [28:17<14:49, 1.44s/it]
62%|βββββββ | 1003/1621 [28:18<14:42, 1.43s/it]
62%|βββββββ | 1004/1621 [28:20<15:49, 1.54s/it]
62%|βββββββ | 1005/1621 [28:22<15:51, 1.55s/it]
62%|βββββββ | 1006/1621 [28:23<15:26, 1.51s/it]
62%|βββββββ | 1007/1621 [28:25<15:20, 1.50s/it]
62%|βββββββ | 1008/1621 [28:26<15:08, 1.48s/it]
62%|βββββββ | 1009/1621 [28:28<14:58, 1.47s/it]
62%|βββββββ | 1010/1621 [28:29<14:46, 1.45s/it]
62%|βββββββ | 1010/1 |
|
|
0: {'loss': 0.2586, 'grad_norm': 0.2985933975674141, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.63} |
|
|
0: 621 [28:29<14:46, 1.45s/it]
62%|βββββββ | 1011/1621 [28:30<14:37, 1.44s/it]
62%|βββββββ | 1012/1621 [28:32<14:59, 1.48s/it]
62%|βββββββ | 1013/1621 [28:33<14:49, 1.46s/it]
63%|βββββββ | 1014/1621 [28:35<14:49, 1.46s/it]
63%|βββββββ | 1015/1621 [28:36<14:39, 1.45s/it]
63%|βββββββ | 1016/1621 [28:38<14:32, 1.44s/it]
63%|βββββββ | 1017/1621 [28:39<14:25, 1.43s/it]
63%|βββββββ | 1018/1621 [28:41<14:34, 1.45s/it]
63%|βββββββ | 1019/1621 [28:42<14:56, 1.49s/it]
63%|βββββββ | 1020/1621 [28:44<14:43, 1.47s/it]
63%|βββββββ | 1020/1621 [28:44<14:43, 1.47s/it]
63%|βββββββ | 1021/1621 [28:45<14:32, 1.45s/it]
63%|βββββββ | 1022/1621 [28:46<14:31, 1.46s/it]
63%|βββββββ | 1023/1621 [28:48<14:39, 1.47s/it]
63%|ββββοΏ½ |
|
|
0: {'loss': 0.2678, 'grad_norm': 0.3133514681538571, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.64} |
|
|
0: οΏ½οΏ½ββ | 1024/1621 [28:49<14:33, 1.46s/it]
63%|βββββββ | 1025/1621 [28:51<14:32, 1.46s/it]
63%|βββββββ | 1026/1621 [28:52<14:24, 1.45s/it]
63%|βββββββ | 1027/1621 [28:54<14:16, 1.44s/it]
63%|βββββββ | 1028/1621 [28:55<14:16, 1.45s/it]
63%|βββββββ | 1029/1621 [28:57<14:13, 1.44s/it]
64%|βββββββ | 1030/1621 [28:58<14:04, 1.43s/it]
64%|βββββββ | 1030/1621 [28:58<14:04, 1.43s/it]
64%|βββββββ | 1031/1621 [28:59<14:10, 1.44s/it]
64%|βββββββ | 1032/1621 [29:01<14:05, 1.44s/it]
64%|βββββββ | 1033/1621 [29:02<14:04, 1.44s/it]
64%|βββββββ | 1034/1621 [29:04<14:03, 1.44s/it]
64%|βββββββ | 1035/1621 [29:05<14:03, 1.44s/it]
64%|βββββββ | 1036/1621 [29:07<14:11, 1.46s/it]
64%|βββββββ | 1037/1621 [29:08<14:32, 1.49s/it] |
|
|
0: {'loss': 0.259, 'grad_norm': 0.29610628515170656, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.64} |
|
|
0:
64%|βββββββ | 1038/1621 [29:10<14:23, 1.48s/it]
64%|βββββββ | 1039/1621 [29:11<14:11, 1.46s/it]
64%|βββββββ | 1040/1621 [29:13<14:00, 1.45s/it]
64%|βββββββ | 1040/1621 [29:13<14:00, 1.45s/it]
64%|βββββββ | 1041/1621 [29:14<13:53, 1.44s/it]
64%|βββββββ | 1042/1621 [29:16<14:05, 1.46s/it]
64%|βββββββ | 1043/1621 [29:17<13:57, 1.45s/it]
64%|βββββββ | 1044/1621 [29:18<14:04, 1.46s/it]
64%|βββββββ | 1045/1621 [29:20<14:40, 1.53s/it]
65%|βββββββ | 1046/1621 [29:22<14:23, 1.50s/it]
65%|βββββββ | 1047/1621 [29:23<14:32, 1.52s/it]
65%|βββββββ | 1048/1621 [29:25<14:13, 1.49s/it]
65%|βββββββ | 1049/1621 [29:26<14:21, 1.51s/it]
65%|βββββββ | 1050/1621 [29:28<15:13, 1.60s/it]
|
|
|
0: {'loss': 0.2581, 'grad_norm': 0.36909157148741467, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.65} |
|
|
0: {'loss': 0.2554, 'grad_norm': 0.32556652711144385, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.65} |
|
|
0:
65%|βββββββ | 1050/1621 [29:28<15:13, 1.60s/it]
65%|βββββββ | 1051/1621 [29:30<15:45, 1.66s/it]
65%|βββββββ | 1052/1621 [29:31<15:01, 1.59s/it]
65%|βββββββ | 1053/1621 [29:33<15:04, 1.59s/it]
65%|βββββββ | 1054/1621 [29:34<14:34, 1.54s/it]
65%|βββββββ | 1055/1621 [29:36<14:18, 1.52s/it]
65%|βββββββ | 1056/1621 [29:37<14:49, 1.57s/it]
65%|βββββββ | 1057/1621 [29:39<14:41, 1.56s/it]
65%|βββββββ | 1058/1621 [29:40<14:14, 1.52s/it]
65%|βββββββ | 1059/1621 [29:42<13:53, 1.48s/it]
65%|βββββββ | 1060/1621 [29:43<13:45, 1.47s/it]
65%|βββββββ | 1060/1621 [29:43<13:45, 1.47s/it]
65%|βββββββ | 1061/1621 [29:45<13:37, 1.46s/it]
66%|βββββββ | 1062/1621 [29:46<13:38, 1.46s/it]
66%|βββββββ | 1063/1621 |
|
|
0: {'loss': 0.2549, 'grad_norm': 0.32018891136312116, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.66} |
|
|
0: [29:48<13:44, 1.48s/it]
66%|βββββββ | 1064/1621 [29:49<14:10, 1.53s/it]
66%|βββββββ | 1065/1621 [29:51<13:50, 1.49s/it]
66%|βββββββ | 1066/1621 [29:52<14:10, 1.53s/it]
66%|βββββββ | 1067/1621 [29:54<13:48, 1.50s/it]
66%|βββββββ | 1068/1621 [29:55<13:34, 1.47s/it]
66%|βββββββ | 1069/1621 [29:57<13:34, 1.47s/it]
66%|βββββββ | 1070/1621 [29:58<13:23, 1.46s/it]
66%|βββββββ | 1070/1621 [29:58<13:23, 1.46s/it]
66%|βββββββ | 1071/1621 [29:59<13:20, 1.46s/it]
66%|βββββββ | 1072/1621 [30:01<13:12, 1.44s/it]
66%|βββββββ | 1073/1621 [30:02<13:07, 1.44s/it]
66%|βββββββ | 1074/1621 [30:04<13:01, 1.43s/it]
66%|βββββββ | 1075/1621 [30:05<13:48, 1.52s/it]
66%|βββββββ | 1076/1621 [30:07<13:40, 1.51s/it]
66%|βββββοΏ½ |
|
|
0: {'loss': 0.2636, 'grad_norm': 0.3165670706179018, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.67} |
|
|
0: οΏ½οΏ½β | 1077/1621 [30:08<13:31, 1.49s/it]
67%|βββββββ | 1078/1621 [30:10<13:19, 1.47s/it]
67%|βββββββ | 1079/1621 [30:11<13:10, 1.46s/it]
67%|βββββββ | 1080/1621 [30:13<13:01, 1.44s/it]
67%|βββββββ | 1080/1621 [30:13<13:01, 1.44s/it]
67%|βββββββ | 1081/1621 [30:14<12:55, 1.44s/it]
67%|βββββββ | 1082/1621 [30:15<12:49, 1.43s/it]
67%|βββββββ | 1083/1621 [30:17<13:35, 1.52s/it]
67%|βββββββ | 1084/1621 [30:19<13:55, 1.56s/it]
67%|βββββββ | 1085/1621 [30:20<13:34, 1.52s/it]
67%|βββββββ | 1086/1621 [30:22<13:16, 1.49s/it]
67%|βββββββ | 1087/1621 [30:23<13:42, 1.54s/it]
67%|βββββββ | 1088/1621 [30:25<13:28, 1.52s/it]
67%|βββββββ | 1089/1621 [30:26<13:11, 1.49s/it]
67%|βββββββ | 1090/1621 [30:28<13:34, 1.53s/it]
|
|
|
0: {'loss': 0.2623, 'grad_norm': 0.32402128653847556, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.67} |
|
|
0: {'loss': 0.2547, 'grad_norm': 0.3385641999618941, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.68} |
|
|
0:
67%|βββββββ | 1090/1621 [30:28<13:34, 1.53s/it]
67%|βββββββ | 1091/1621 [30:29<13:20, 1.51s/it]
67%|βββββββ | 1092/1621 [30:31<13:19, 1.51s/it]
67%|βββββββ | 1093/1621 [30:32<13:03, 1.48s/it]
67%|βββββββ | 1094/1621 [30:34<12:51, 1.46s/it]
68%|βββββββ | 1095/1621 [30:35<12:43, 1.45s/it]
68%|βββββββ | 1096/1621 [30:37<13:12, 1.51s/it]
68%|βββββββ | 1097/1621 [30:38<13:12, 1.51s/it]
68%|βββββββ | 1098/1621 [30:40<12:59, 1.49s/it]
68%|βββββββ | 1099/1621 [30:41<13:02, 1.50s/it]
68%|βββββββ | 1100/1621 [30:43<12:51, 1.48s/it]
68%|βββββββ | 1100/1621 [30:43<12:51, 1.48s/it]
68%|βββββββ | 1101/1621 [30:44<13:10, 1.52s/it]
68%|βββββββ | 1102/1621 [30:46<12:59, 1.50s/ |
|
|
0: {'loss': 0.2559, 'grad_norm': 0.30780759695128085, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.68} |
|
|
0: it]
68%|βββββββ | 1103/1621 [30:47<12:48, 1.48s/it]
68%|βββββββ | 1104/1621 [30:49<12:43, 1.48s/it]
68%|βββββββ | 1105/1621 [30:50<12:32, 1.46s/it]
68%|βββββββ | 1106/1621 [30:51<12:28, 1.45s/it]
68%|βββββββ | 1107/1621 [30:53<12:26, 1.45s/it]
68%|βββββββ | 1108/1621 [30:54<12:29, 1.46s/it]
68%|βββββββ | 1109/1621 [30:56<12:28, 1.46s/it]
68%|βββββββ | 1110/1621 [30:57<12:19, 1.45s/it]
68%|βββββββ | 1110/1621 [30:57<12:19, 1.45s/it]
69%|βββββββ | 1111/1621 [30:59<12:26, 1.46s/it]
69%|βββββββ | 1112/1621 [31:00<12:24, 1.46s/it]
69%|βββββββ | 1113/1621 [31:02<12:36, 1.49s/it]
69%|βββββββ | 1114/1621 [31:03<12:34, 1.49s/it]
69%|βββββββ | 1115/1621 [31:05<13:06, 1.55s/it]
69%|βββββββ | 1116/1621 [3 |
|
|
0: {'loss': 0.256, 'grad_norm': 0.3035060051707476, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.69} |
|
|
0: 1:06<12:51, 1.53s/it]
69%|βββββββ | 1117/1621 [31:08<12:41, 1.51s/it]
69%|βββββββ | 1118/1621 [31:09<12:27, 1.49s/it]
69%|βββββββ | 1119/1621 [31:11<12:19, 1.47s/it]
69%|βββββββ | 1120/1621 [31:12<12:11, 1.46s/it]
69%|βββββββ | 1120/1621 [31:12<12:11, 1.46s/it]
69%|βββββββ | 1121/1621 [31:14<12:18, 1.48s/it]
69%|βββββββ | 1122/1621 [31:15<12:12, 1.47s/it]
69%|βββββββ | 1123/1621 [31:17<12:08, 1.46s/it]
69%|βββββββ | 1124/1621 [31:18<12:06, 1.46s/it]
69%|βββββββ | 1125/1621 [31:20<12:06, 1.46s/it]
69%|βββββββ | 1126/1621 [31:21<11:59, 1.45s/it]
70%|βββββββ | 1127/1621 [31:22<11:57, 1.45s/it]
70%|βββββββ | 1128/1621 [31:24<11:59, 1.46s/it]
70%|βββββββ | 1129/1621 [31:25<12:04, 1.47s/it]
70%|ββββββοΏ½ |
|
|
0: {'loss': 0.2548, 'grad_norm': 0.30070942936522693, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.7} |
|
|
0: {'loss': 0.2575, 'grad_norm': 0.31196101931665887, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.7} |
|
|
0: οΏ½οΏ½ | 1130/1621 [31:27<11:56, 1.46s/it]
70%|βββββββ | 1130/1621 [31:27<11:56, 1.46s/it]
70%|βββββββ | 1131/1621 [31:28<11:48, 1.45s/it]
70%|βββββββ | 1132/1621 [31:30<11:46, 1.44s/it]
70%|βββββββ | 1133/1621 [31:31<12:06, 1.49s/it]
70%|βββββββ | 1134/1621 [31:33<11:55, 1.47s/it]
70%|βββββββ | 1135/1621 [31:34<11:49, 1.46s/it]
70%|βββββββ | 1136/1621 [31:36<11:41, 1.45s/it]
70%|βββββββ | 1137/1621 [31:37<12:35, 1.56s/it]
70%|βββββββ | 1138/1621 [31:39<12:14, 1.52s/it]
70%|βββββββ | 1139/1621 [31:40<12:05, 1.51s/it]
70%|βββββββ | 1140/1621 [31:42<11:53, 1.48s/it]
70%|βββββββ | 1140/1621 [31:42<11:53, 1.48s/it]
70%|βββββββ | 1141/1621 [31:43<11:49, 1.48s/it]
70%|ββββοΏ½ |
|
|
0: {'loss': 0.2523, 'grad_norm': 0.3316408275634874, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.71} |
|
|
0: οΏ½οΏ½ββ | 1142/1621 [31:45<12:06, 1.52s/it]
71%|βββββββ | 1143/1621 [31:46<12:37, 1.59s/it]
71%|βββββββ | 1144/1621 [31:48<12:16, 1.54s/it]
71%|βββββββ | 1145/1621 [31:49<11:56, 1.51s/it]
71%|βββββββ | 1146/1621 [31:51<11:46, 1.49s/it]
71%|βββββββ | 1147/1621 [31:52<11:41, 1.48s/it]
71%|βββββββ | 1148/1621 [31:54<12:11, 1.55s/it]
71%|βββββββ | 1149/1621 [31:55<11:52, 1.51s/it]
71%|βββββββ | 1150/1621 [31:57<12:05, 1.54s/it]
71%|βββββββ | 1150/1621 [31:57<12:05, 1.54s/it]
71%|βββββββ | 1151/1621 [31:59<12:14, 1.56s/it]
71%|βββββββ | 1152/1621 [32:00<12:02, 1.54s/it]
71%|βββββββ | 1153/1621 [32:02<11:55, 1.53s/it]
71%|βββββββ | 1154/1621 [32:03<12:03, 1.55s/it]
71%|ββββββββ | 1155/1621 [32:05<11:42, 1.51s/i |
|
|
0: {'loss': 0.2534, 'grad_norm': 0.31318013713966153, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.72} |
|
|
0: t]
71%|ββββββββ | 1156/1621 [32:06<11:27, 1.48s/it]
71%|ββββββββ | 1157/1621 [32:07<11:24, 1.48s/it]
71%|ββββββββ | 1158/1621 [32:09<11:15, 1.46s/it]
71%|ββββββββ | 1159/1621 [32:10<11:12, 1.46s/it]
72%|ββββββββ | 1160/1621 [32:12<11:08, 1.45s/it]
72%|ββββββββ | 1160/1621 [32:12<11:08, 1.45s/it]
72%|ββββββββ | 1161/1621 [32:13<11:02, 1.44s/it]
72%|ββββββββ | 1162/1621 [32:15<11:02, 1.44s/it]
72%|ββββββββ | 1163/1621 [32:16<10:56, 1.43s/it]
72%|ββββββββ | 1164/1621 [32:18<10:55, 1.43s/it]
72%|ββββββββ | 1165/1621 [32:19<10:55, 1.44s/it]
72%|ββββββββ | 1166/1621 [32:20<10:51, 1.43s/it]
72%|ββββββββ | 1167/1621 [32:22<10:47, 1.43s/it]
72%|ββββββββ | 1168/1621 [32:23<10:48, 1.43s/it]
72%|βββοΏ½ |
|
|
0: {'loss': 0.2614, 'grad_norm': 0.3167000616112638, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.72} |
|
|
0: {'loss': 0.2543, 'grad_norm': 0.314470839342128, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.73} |
|
|
0: οΏ½ββββ | 1169/1621 [32:25<10:48, 1.43s/it]
72%|ββββββββ | 1170/1621 [32:26<11:20, 1.51s/it]
72%|ββββββββ | 1170/1621 [32:26<11:20, 1.51s/it]
72%|ββββββββ | 1171/1621 [32:28<11:14, 1.50s/it]
72%|ββββββββ | 1172/1621 [32:29<11:02, 1.48s/it]
72%|ββββββββ | 1173/1621 [32:31<10:51, 1.46s/it]
72%|ββββββββ | 1174/1621 [32:32<10:48, 1.45s/it]
72%|ββββββββ | 1175/1621 [32:34<10:55, 1.47s/it]
73%|ββββββββ | 1176/1621 [32:35<10:54, 1.47s/it]
73%|ββββββββ | 1177/1621 [32:37<10:48, 1.46s/it]
73%|ββββββββ | 1178/1621 [32:38<10:45, 1.46s/it]
73%|ββββββββ | 1179/1621 [32:40<11:16, 1.53s/it]
73%|ββββββββ | 1180/1621 [32:41<11:05, 1.51s/it]
73%|ββββββββ | 1180/1621 [32:41< |
|
|
0: {'loss': 0.2612, 'grad_norm': 0.2997645116358892, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.73} |
|
|
0: 11:05, 1.51s/it]
73%|ββββββββ | 1181/1621 [32:43<10:51, 1.48s/it]
73%|ββββββββ | 1182/1621 [32:44<10:42, 1.46s/it]
73%|ββββββββ | 1183/1621 [32:45<10:35, 1.45s/it]
73%|ββββββββ | 1184/1621 [32:47<10:34, 1.45s/it]
73%|ββββββββ | 1185/1621 [32:48<10:30, 1.45s/it]
73%|ββββββββ | 1186/1621 [32:50<10:27, 1.44s/it]
73%|ββββββββ | 1187/1621 [32:51<10:22, 1.43s/it]
73%|ββββββββ | 1188/1621 [32:53<10:24, 1.44s/it]
73%|ββββββββ | 1189/1621 [32:54<10:20, 1.44s/it]
73%|ββββββββ | 1190/1621 [32:55<10:15, 1.43s/it]
73%|ββββββββ | 1190/1621 [32:55<10:15, 1.43s/it]
73%|ββββββββ | 1191/1621 [32:57<10:13, 1.43s/it]
74%|ββββββββ | 1192/1621 [32:58<10:37, 1.49s/it]
74%|ββββββββ | 1193/1621 [33:00<10:28, 1.47s/it]
|
|
|
0: {'loss': 0.2579, 'grad_norm': 0.3118112550629538, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.74} |
|
|
0: 74%|ββββββββ | 1194/1621 [33:01<10:21, 1.46s/it]
74%|ββββββββ | 1195/1621 [33:03<10:15, 1.44s/it]
74%|ββββββββ | 1196/1621 [33:04<10:12, 1.44s/it]
74%|ββββββββ | 1197/1621 [33:06<10:06, 1.43s/it]
74%|ββββββββ | 1198/1621 [33:07<10:06, 1.43s/it]
74%|ββββββββ | 1199/1621 [33:08<10:04, 1.43s/it]
74%|ββββββββ | 1200/1621 [33:10<10:00, 1.43s/it]
74%|ββββββββ | 1200/1621 [33:10<10:00, 1.43s/it]
74%|ββββββββ | 1201/1621 [33:11<10:01, 1.43s/it]
74%|ββββββββ | 1202/1621 [33:13<09:57, 1.42s/it]
74%|ββββββββ | 1203/1621 [33:14<10:04, 1.45s/it]
74%|ββββββββ | 1204/1621 [33:16<10:05, 1.45s/it]
74%|ββββββββ | 1205/1621 [33:17<10:05, 1.46s/it]
74%|ββββββββ | 1206/1621 [33:19<09:58, 1.44s/it]
74%|βββββ |
|
|
0: {'loss': 0.2595, 'grad_norm': 0.3115748471949704, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.75} |
|
|
0: βββ | 1207/1621 [33:20<09:53, 1.43s/it]
75%|ββββββββ | 1208/1621 [33:21<09:50, 1.43s/it]
75%|ββββββββ | 1209/1621 [33:23<09:48, 1.43s/it]
75%|ββββββββ | 1210/1621 [33:24<09:43, 1.42s/it]
75%|ββββββββ | 1210/1621 [33:24<09:43, 1.42s/it]
75%|ββββββββ | 1211/1621 [33:26<09:43, 1.42s/it]
75%|ββββββββ | 1212/1621 [33:27<09:42, 1.42s/it]
75%|ββββββββ | 1213/1621 [33:29<09:41, 1.43s/it]
75%|ββββββββ | 1214/1621 [33:30<09:38, 1.42s/it]
75%|ββββββββ | 1215/1621 [33:31<09:47, 1.45s/it]
75%|ββββββββ | 1216/1621 [33:33<09:43, 1.44s/it]
75%|ββββββββ | 1217/1621 [33:34<09:38, 1.43s/it]
75%|ββββββββ | 1218/1621 [33:36<09:39, 1.44s/it]
75%|ββββββββ | 1219/1621 [33:37<09:35, 1.43s/it]
75%|ββββββββ | 1220/1 |
|
|
0: {'loss': 0.2622, 'grad_norm': 0.3260517255087193, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.75} |
|
|
0: {'loss': 0.2566, 'grad_norm': 0.29314960466358514, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.76} |
|
|
0: 621 [33:39<09:34, 1.43s/it]
75%|ββββββββ | 1220/1621 [33:39<09:34, 1.43s/it]
75%|ββββββββ | 1221/1621 [33:40<09:37, 1.44s/it]
75%|ββββββββ | 1222/1621 [33:41<09:32, 1.43s/it]
75%|ββββββββ | 1223/1621 [33:43<09:30, 1.43s/it]
76%|ββββββββ | 1224/1621 [33:44<09:37, 1.46s/it]
76%|ββββββββ | 1225/1621 [33:46<09:30, 1.44s/it]
76%|ββββββββ | 1226/1621 [33:47<09:26, 1.43s/it]
76%|ββββββββ | 1227/1621 [33:49<09:29, 1.45s/it]
76%|ββββββββ | 1228/1621 [33:50<09:24, 1.44s/it]
76%|ββββββββ | 1229/1621 [33:52<09:18, 1.43s/it]
76%|ββββββββ | 1230/1621 [33:53<09:18, 1.43s/it]
76%|ββββββββ | 1230/1621 [33:53<09:18, 1.43s/it]
76%|ββββββββ | 1231/1621 [33:54<09:20, 1.44s/it]
76%| |
|
|
0: {'loss': 0.2569, 'grad_norm': 0.3515763746212918, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.76} |
|
|
0: ββββββββ | 1232/1621 [33:56<09:17, 1.43s/it]
76%|ββββββββ | 1233/1621 [33:57<09:21, 1.45s/it]
76%|ββββββββ | 1234/1621 [33:59<09:20, 1.45s/it]
76%|ββββββββ | 1235/1621 [34:00<09:14, 1.44s/it]
76%|ββββββββ | 1236/1621 [34:02<09:10, 1.43s/it]
76%|ββββββββ | 1237/1621 [34:03<09:27, 1.48s/it]
76%|ββββββββ | 1238/1621 [34:05<09:22, 1.47s/it]
76%|ββββββββ | 1239/1621 [34:06<09:17, 1.46s/it]
76%|ββββββββ | 1240/1621 [34:07<09:11, 1.45s/it]
76%|ββββββββ | 1240/1621 [34:07<09:11, 1.45s/it]
77%|ββββββββ | 1241/1621 [34:09<09:08, 1.44s/it]
77%|ββββββββ | 1242/1621 [34:10<09:05, 1.44s/it]
77%|ββββββββ | 1243/1621 [34:12<09:03, 1.44s/it]
77%|ββββββββ | 1244/1621 [34:13<09:07, 1.45s/it]
77%|ββββββοΏ½ |
|
|
0: {'loss': 0.2568, 'grad_norm': 0.318913671719499, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.77} |
|
|
0: οΏ½οΏ½β | 1245/1621 [34:15<09:02, 1.44s/it]
77%|ββββββββ | 1246/1621 [34:16<09:00, 1.44s/it]
77%|ββββββββ | 1247/1621 [34:18<08:57, 1.44s/it]
77%|ββββββββ | 1248/1621 [34:19<08:55, 1.44s/it]
77%|ββββββββ | 1249/1621 [34:21<09:18, 1.50s/it]
77%|ββββββββ | 1250/1621 [34:22<09:23, 1.52s/it]
77%|ββββββββ | 1250/1621 [34:22<09:23, 1.52s/it]
77%|ββββββββ | 1251/1621 [34:24<09:28, 1.54s/it]
77%|ββββββββ | 1252/1621 [34:25<09:19, 1.52s/it]
77%|ββββββββ | 1253/1621 [34:27<09:18, 1.52s/it]
77%|ββββββββ | 1254/1621 [34:28<09:18, 1.52s/it]
77%|ββββββββ | 1255/1621 [34:30<09:34, 1.57s/it]
77%|ββββββββ | 1256/1621 [34:31<09:15, 1.52s/it]
78%|ββββββββ | 1257/1621 [34:33<09:02, 1.49s/it]
78%|ββββββββ | 1258/1621 |
|
|
0: {'loss': 0.2585, 'grad_norm': 0.3250021337489215, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.78} |
|
|
0: {'loss': 0.2579, 'grad_norm': 0.31493837801792507, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.78} |
|
|
0: [34:34<08:57, 1.48s/it]
78%|ββββββββ | 1259/1621 [34:36<08:49, 1.46s/it]
78%|ββββββββ | 1260/1621 [34:37<08:46, 1.46s/it]
78%|ββββββββ | 1260/1621 [34:37<08:46, 1.46s/it]
78%|ββββββββ | 1261/1621 [34:39<08:44, 1.46s/it]
78%|ββββββββ | 1262/1621 [34:40<09:09, 1.53s/it]
78%|ββββββββ | 1263/1621 [34:42<08:57, 1.50s/it]
78%|ββββββββ | 1264/1621 [34:43<08:50, 1.48s/it]
78%|ββββββββ | 1265/1621 [34:45<08:45, 1.48s/it]
78%|ββββββββ | 1266/1621 [34:46<08:37, 1.46s/it]
78%|ββββββββ | 1267/1621 [34:47<08:30, 1.44s/it]
78%|ββββββββ | 1268/1621 [34:49<08:48, 1.50s/it]
78%|ββββββββ | 1269/1621 [34:50<08:37, 1.47s/it]
78%|ββββββββ | 1270/1621 [34:52<08:29, 1.45s/it]
78%|βοΏ½ |
|
|
0: {'loss': 0.252, 'grad_norm': 0.30830369578361677, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.79} |
|
|
0: οΏ½οΏ½ββββββ | 1270/1621 [34:52<08:29, 1.45s/it]
78%|ββββββββ | 1271/1621 [34:53<08:42, 1.49s/it]
78%|ββββββββ | 1272/1621 [34:55<08:33, 1.47s/it]
79%|ββββββββ | 1273/1621 [34:56<08:28, 1.46s/it]
79%|ββββββββ | 1274/1621 [34:58<08:40, 1.50s/it]
79%|ββββββββ | 1275/1621 [34:59<08:33, 1.48s/it]
79%|ββββββββ | 1276/1621 [35:01<08:26, 1.47s/it]
79%|ββββββββ | 1277/1621 [35:02<08:35, 1.50s/it]
79%|ββββββββ | 1278/1621 [35:04<08:26, 1.48s/it]
79%|ββββββββ | 1279/1621 [35:05<08:23, 1.47s/it]
79%|ββββββββ | 1280/1621 [35:07<08:24, 1.48s/it]
79%|ββββββββ | 1280/1621 [35:07<08:24, 1.48s/it]
79%|ββββββββ | 1281/1621 [35:08<08:18, 1.47s/it]
79%|ββββββββ | 1282/1621 [35:10<08:12, 1.45s/it]
79%|βββββββοΏ½ |
|
|
0: {'loss': 0.2576, 'grad_norm': 0.3198839807651436, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.8} |
|
|
0: οΏ½ | 1283/1621 [35:11<08:07, 1.44s/it]
79%|ββββββββ | 1284/1621 [35:13<08:10, 1.46s/it]
79%|ββββββββ | 1285/1621 [35:14<08:06, 1.45s/it]
79%|ββββββββ | 1286/1621 [35:15<08:08, 1.46s/it]
79%|ββββββββ | 1287/1621 [35:17<08:23, 1.51s/it]
79%|ββββββββ | 1288/1621 [35:18<08:13, 1.48s/it]
80%|ββββββββ | 1289/1621 [35:20<08:10, 1.48s/it]
80%|ββββββββ | 1290/1621 [35:21<08:06, 1.47s/it]
80%|ββββββββ | 1290/1621 [35:21<08:06, 1.47s/it]
80%|ββββββββ | 1291/1621 [35:23<08:00, 1.46s/it]
80%|ββββββββ | 1292/1621 [35:24<07:56, 1.45s/it]
80%|ββββββββ | 1293/1621 [35:26<07:56, 1.45s/it]
80%|ββββββββ | 1294/1621 [35:27<07:51, 1.44s/it]
80%|ββββββββ | 1295/1621 [35:29<07:52, 1.45s/it]
80%|ββββββββ | 1296/1621 [35: |
|
|
0: {'loss': 0.2577, 'grad_norm': 0.3200484579932923, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.8} |
|
|
0: [2025-09-02 19:23:26,549] [INFO] [axolotl.core.trainers.base._save:613] [PID:1478787] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0/checkpoint-1300[39m |
|
|
0: [2025-09-02 19:23:31,543] [INFO] [axolotl.core.trainers.base._save:662] [PID:1478787] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`[39m |
|
|
0: 30<07:53, 1.46s/it]
80%|ββββββββ | 1297/1621 [35:32<07:52, 1.46s/it]
80%|ββββββββ | 1298/1621 [35:33<07:49, 1.45s/it]
80%|ββββββββ | 1299/1621 [35:34<07:43, 1.44s/it]
80%|ββββββββ | 1300/1621 [35:36<07:42, 1.44s/it]
80%|ββββββββ | 1300/1621 [35:36<07:42, 1.44s/it]
80%|ββββββββ | 1301/1621 [35:47<22:41, 4.25s/it]
80%|ββββββββ | 1302/1621 [35:48<18:06, 3.41s/it]
80%|ββββββββ | 1303/1621 [35:49<14:53, 2.81s/it]
80%|ββββββββ | 1304/1621 [35:51<12:58, 2.46s/it]
81%|ββββββββ | 1305/1621 [35:53<11:20, 2.15s/it]
81%|ββββββββ | 1306/1621 [35:54<10:11, 1.94s/it]
81%|ββββββββ | 1307/1621 [35:55<09:20, 1.79s/it]
81%|ββββββββ | 1308/1621 [35:57<08:43, 1.67s/it]
81%|ββββββββ | 1309/1621 [35:58<08:21, 1.61s/it |
|
|
0: {'loss': 0.2561, 'grad_norm': 0.32112274253173473, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.81} |
|
|
0: {'loss': 0.2534, 'grad_norm': 0.3283578387479428, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.81} |
|
|
0: ]
81%|ββββββββ | 1310/1621 [36:00<08:01, 1.55s/it]
81%|ββββββββ | 1310/1621 [36:00<08:01, 1.55s/it]
81%|ββββββββ | 1311/1621 [36:01<07:50, 1.52s/it]
81%|ββββββββ | 1312/1621 [36:03<07:39, 1.49s/it]
81%|ββββββββ | 1313/1621 [36:04<08:01, 1.56s/it]
81%|ββββββββ | 1314/1621 [36:06<08:18, 1.62s/it]
81%|ββββββββ | 1315/1621 [36:08<07:59, 1.57s/it]
81%|ββββββββ | 1316/1621 [36:09<07:47, 1.53s/it]
81%|ββββββββ | 1317/1621 [36:11<07:46, 1.54s/it]
81%|βββββββββ | 1318/1621 [36:12<07:35, 1.50s/it]
81%|βββββββββ | 1319/1621 [36:13<07:28, 1.49s/it]
81%|βββββββββ | 1320/1621 [36:15<07:20, 1.46s/it]
81%|βββββββββ | 1320/1621 [36:15<07:20, 1.46s/it]
81%|ββββββοΏ½ |
|
|
0: {'loss': 0.2488, 'grad_norm': 0.30823559386484073, 'learning_rate': 4.9921089333113855e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.82} |
|
|
0: οΏ½οΏ½ββ | 1321/1621 [36:16<07:15, 1.45s/it]
82%|βββββββββ | 1322/1621 [36:18<07:11, 1.44s/it]
82%|βββββββββ | 1323/1621 [36:19<07:12, 1.45s/it]
82%|βββββββββ | 1324/1621 [36:21<07:06, 1.44s/it]
82%|βββββββββ | 1325/1621 [36:22<07:13, 1.46s/it]
82%|βββββββββ | 1326/1621 [36:23<07:07, 1.45s/it]
82%|βββββββββ | 1327/1621 [36:25<07:01, 1.43s/it]
82%|βββββββββ | 1328/1621 [36:27<07:29, 1.53s/it]
82%|βββββββββ | 1329/1621 [36:28<07:22, 1.52s/it]
82%|βββββββββ | 1330/1621 [36:30<07:13, 1.49s/it]
82%|βββββββββ | 1330/1621 [36:30<07:13, 1.49s/it]
82%|βββββββββ | 1331/1621 [36:31<07:07, 1.47s/it]
82%|βββββββββ | 1332/1621 [36:32<07:03, 1.46s/it]
82%|βββββββββ | 1333/1621 [36:34<06:56, 1.45s/it]
82%|βββοΏ½ |
|
|
0: {'loss': 0.2566, 'grad_norm': 0.30367199165447895, 'learning_rate': 4.96014631413955e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.83} |
|
|
0: οΏ½οΏ½βββββ | 1334/1621 [36:35<06:53, 1.44s/it]
82%|βββββββββ | 1335/1621 [36:37<06:52, 1.44s/it]
82%|βββββββββ | 1336/1621 [36:38<06:47, 1.43s/it]
82%|βββββββββ | 1337/1621 [36:39<06:45, 1.43s/it]
83%|βββββββββ | 1338/1621 [36:41<06:43, 1.43s/it]
83%|βββββββββ | 1339/1621 [36:42<06:40, 1.42s/it]
83%|βββββββββ | 1340/1621 [36:44<06:38, 1.42s/it]
83%|βββββββββ | 1340/1621 [36:44<06:38, 1.42s/it]
83%|βββββββββ | 1341/1621 [36:45<06:48, 1.46s/it]
83%|βββββββββ | 1342/1621 [36:47<06:43, 1.45s/it]
83%|βββββββββ | 1343/1621 [36:48<06:40, 1.44s/it]
83%|βββββββββ | 1344/1621 [36:50<06:36, 1.43s/it]
83%|βββββββββ | 1345/1621 [36:51<06:40, 1.45s/it]
83%|βββββββββ | 1346/1621 [36:53<06:42, 1.46s/it]
83%|οΏ½ |
|
|
0: {'loss': 0.2512, 'grad_norm': 0.2884465188753474, 'learning_rate': 4.903968869447152e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.83} |
|
|
0: οΏ½οΏ½ββββββββ | 1347/1621 [36:54<06:53, 1.51s/it]
83%|βββββββββ | 1348/1621 [36:56<06:48, 1.50s/it]
83%|βββββββββ | 1349/1621 [36:57<06:51, 1.51s/it]
83%|βββββββββ | 1350/1621 [36:59<06:45, 1.49s/it]
83%|βββββββββ | 1350/1621 [36:59<06:45, 1.49s/it]
83%|βββββββββ | 1351/1621 [37:00<06:37, 1.47s/it]
83%|βββββββββ | 1352/1621 [37:01<06:32, 1.46s/it]
83%|βββββββββ | 1353/1621 [37:03<06:27, 1.44s/it]
84%|βββββββββ | 1354/1621 [37:04<06:26, 1.45s/it]
84%|βββββββββ | 1355/1621 [37:06<06:52, 1.55s/it]
84%|βββββββββ | 1356/1621 [37:08<06:39, 1.51s/it]
84%|βββββββββ | 1357/1621 [37:09<07:04, 1.61s/it]
84%|βββββββββ | 1358/1621 [37:11<06:48, 1.55s/it]
84%|βββββββββ | 1359/1621 [37:12<06:40, 1.53s/i |
|
|
0: {'loss': 0.2567, 'grad_norm': 0.32993917642338455, 'learning_rate': 4.824192091074126e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.84} |
|
|
0: {'loss': 0.2595, 'grad_norm': 0.29082902932313515, 'learning_rate': 4.721690030098693e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.85} |
|
|
0: t]
84%|βββββββββ | 1360/1621 [37:14<06:31, 1.50s/it]
84%|βββββββββ | 1360/1621 [37:14<06:31, 1.50s/it]
84%|βββββββββ | 1361/1621 [37:15<06:25, 1.48s/it]
84%|βββββββββ | 1362/1621 [37:17<06:20, 1.47s/it]
84%|βββββββββ | 1363/1621 [37:18<06:18, 1.47s/it]
84%|βββββββββ | 1364/1621 [37:19<06:14, 1.46s/it]
84%|βββββββββ | 1365/1621 [37:21<06:28, 1.52s/it]
84%|βββββββββ | 1366/1621 [37:23<06:17, 1.48s/it]
84%|βββββββββ | 1367/1621 [37:24<06:20, 1.50s/it]
84%|βββββββββ | 1368/1621 [37:26<06:17, 1.49s/it]
84%|βββββββββ | 1369/1621 [37:27<06:34, 1.56s/it]
85%|βββββββββ | 1370/1621 [37:29<06:25, 1.54s/it]
85%|βββββββββ | 1370/1621 [37:29<06:25, 1.54s/it]
85%| |
|
|
0: {'loss': 0.2575, 'grad_norm': 0.30422469733818247, 'learning_rate': 4.5975857205508345e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.85} |
|
|
0: βββββββββ | 1371/1621 [37:30<06:17, 1.51s/it]
85%|βββββββββ | 1372/1621 [37:32<06:13, 1.50s/it]
85%|βββββββββ | 1373/1621 [37:33<06:14, 1.51s/it]
85%|βββββββββ | 1374/1621 [37:35<06:07, 1.49s/it]
85%|βββββββββ | 1375/1621 [37:37<06:55, 1.69s/it]
85%|βββββββββ | 1376/1621 [37:38<06:37, 1.62s/it]
85%|βββββββββ | 1377/1621 [37:40<06:24, 1.58s/it]
85%|βββββββββ | 1378/1621 [37:41<06:11, 1.53s/it]
85%|βββββββββ | 1379/1621 [37:43<06:05, 1.51s/it]
85%|βββββββββ | 1380/1621 [37:44<05:57, 1.48s/it]
85%|βββββββββ | 1380/1621 [37:44<05:57, 1.48s/it]
85%|βββββββββ | 1381/1621 [37:45<05:53, 1.47s/it]
85%|βββββββββ | 1382/1621 [37:47<05:48, 1.46s/it]
85%|βββββββββ | 1383/1621 [37:48<05:43, 1.44s/ |
|
|
0: {'loss': 0.2561, 'grad_norm': 0.319920303484507, 'learning_rate': 4.453238875216452e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.86} |
|
|
0: it]
85%|βββββββββ | 1384/1621 [37:50<05:39, 1.43s/it]
85%|βββββββββ | 1385/1621 [37:51<05:39, 1.44s/it]
86%|βββββββββ | 1386/1621 [37:53<05:47, 1.48s/it]
86%|βββββββββ | 1387/1621 [37:54<05:55, 1.52s/it]
86%|βββββββββ | 1388/1621 [37:56<06:00, 1.55s/it]
86%|βββββββββ | 1389/1621 [37:57<05:49, 1.51s/it]
86%|βββββββββ | 1390/1621 [37:59<05:43, 1.49s/it]
86%|βββββββββ | 1390/1621 [37:59<05:43, 1.49s/it]
86%|βββββββββ | 1391/1621 [38:00<05:43, 1.50s/it]
86%|βββββββββ | 1392/1621 [38:02<05:37, 1.48s/it]
86%|βββββββββ | 1393/1621 [38:03<05:35, 1.47s/it]
86%|βββββββββ | 1394/1621 [38:05<05:45, 1.52s/it]
86%|βββββββββ | 1395/1621 [38:07<05:57, 1.58s/it]
86%|βββββββββ | 1396/1621 [38:08<05:48 |
|
|
0: {'loss': 0.2553, 'grad_norm': 0.3046012760094752, 'learning_rate': 4.29023098833955e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.86} |
|
|
0: , 1.55s/it]
86%|βββββββββ | 1397/1621 [38:09<05:37, 1.51s/it]
86%|βββββββββ | 1398/1621 [38:11<05:40, 1.53s/it]
86%|βββββββββ | 1399/1621 [38:13<05:39, 1.53s/it]
86%|βββββββββ | 1400/1621 [38:14<05:32, 1.50s/it]
86%|βββββββββ | 1400/1621 [38:14<05:32, 1.50s/it]
86%|βββββββββ | 1401/1621 [38:15<05:25, 1.48s/it]
86%|βββββββββ | 1402/1621 [38:17<05:35, 1.53s/it]
87%|βββββββββ | 1403/1621 [38:19<05:26, 1.50s/it]
87%|βββββββββ | 1404/1621 [38:20<05:20, 1.48s/it]
87%|βββββββββ | 1405/1621 [38:21<05:15, 1.46s/it]
87%|βββββββββ | 1406/1621 [38:23<05:11, 1.45s/it]
87%|βββββββββ | 1407/1621 [38:24<05:07, 1.44s/it]
87%|βββββββββ | 1408/1621 [38:26<05:10, 1.46s/it]
87%|βββββββββ | 1409/1621 [38 |
|
|
0: {'loss': 0.2563, 'grad_norm': 0.30239637984353285, 'learning_rate': 4.110348008440344e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.87} |
|
|
0: {'loss': 0.2615, 'grad_norm': 0.30827484115318715, 'learning_rate': 3.915560771089544e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.88} |
|
|
0: :27<05:07, 1.45s/it]
87%|βββββββββ | 1410/1621 [38:29<05:06, 1.45s/it]
87%|βββββββββ | 1410/1621 [38:29<05:06, 1.45s/it]
87%|βββββββββ | 1411/1621 [38:30<05:07, 1.47s/it]
87%|βββββββββ | 1412/1621 [38:32<05:03, 1.45s/it]
87%|βββββββββ | 1413/1621 [38:33<05:00, 1.44s/it]
87%|βββββββββ | 1414/1621 [38:35<05:11, 1.51s/it]
87%|βββββββββ | 1415/1621 [38:36<05:15, 1.53s/it]
87%|βββββββββ | 1416/1621 [38:38<05:06, 1.49s/it]
87%|βββββββββ | 1417/1621 [38:39<05:02, 1.48s/it]
87%|βββββββββ | 1418/1621 [38:40<04:57, 1.46s/it]
88%|βββββββββ | 1419/1621 [38:42<04:54, 1.46s/it]
88%|βββββββββ | 1420/1621 [38:43<04:50, 1.45s/it]
88%|βββββββββ | 1420/1621 [38:43<04:5 |
|
|
0: {'loss': 0.2523, 'grad_norm': 0.2940110425585398, 'learning_rate': 3.7080034060214136e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.88} |
|
|
0: 0, 1.45s/it]
88%|βββββββββ | 1421/1621 [38:45<05:00, 1.50s/it]
88%|βββββββββ | 1422/1621 [38:46<04:53, 1.48s/it]
88%|βββββββββ | 1423/1621 [38:48<04:48, 1.46s/it]
88%|βββββββββ | 1424/1621 [38:49<04:45, 1.45s/it]
88%|βββββββββ | 1425/1621 [38:51<04:42, 1.44s/it]
88%|βββββββββ | 1426/1621 [38:52<04:39, 1.43s/it]
88%|βββββββββ | 1427/1621 [38:54<04:41, 1.45s/it]
88%|βββββββββ | 1428/1621 [38:55<04:39, 1.45s/it]
88%|βββββββββ | 1429/1621 [38:56<04:36, 1.44s/it]
88%|βββββββββ | 1430/1621 [38:58<04:43, 1.49s/it]
88%|βββββββββ | 1430/1621 [38:58<04:43, 1.49s/it]
88%|βββββββββ | 1431/1621 [39:00<04:41, 1.48s/it]
88%|βββββββββ | 1432/1621 [39:01<04:36, 1.46s/it]
88%|βββββββββ | 1433/1621 [3 |
|
|
0: {'loss': 0.2507, 'grad_norm': 0.3055514932413729, 'learning_rate': 3.489949955161813e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.89} |
|
|
0: 9:02<04:32, 1.45s/it]
88%|βββββββββ | 1434/1621 [39:04<04:30, 1.45s/it]
89%|βββββββββ | 1435/1621 [39:05<04:34, 1.47s/it]
89%|βββββββββ | 1436/1621 [39:07<04:30, 1.46s/it]
89%|βββββββββ | 1437/1621 [39:08<04:28, 1.46s/it]
89%|βββββββββ | 1438/1621 [39:10<04:28, 1.46s/it]
89%|βββββββββ | 1439/1621 [39:11<04:23, 1.45s/it]
89%|βββββββββ | 1440/1621 [39:13<04:39, 1.54s/it]
89%|βββββββββ | 1440/1621 [39:13<04:39, 1.54s/it]
89%|βββββββββ | 1441/1621 [39:14<04:32, 1.51s/it]
89%|βββββββββ | 1442/1621 [39:16<04:26, 1.49s/it]
89%|βββββββββ | 1443/1621 [39:17<04:21, 1.47s/it]
89%|βββββββββ | 1444/1621 [39:19<04:17, 1.45s/it]
89%|βββββββββ | 1445/1621 [39:20<04:14, 1.45s/it]
89%|βββββββββ | 144 |
|
|
0: {'loss': 0.2592, 'grad_norm': 0.308632847742623, 'learning_rate': 3.263789457748976e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.89} |
|
|
0: 6/1621 [39:21<04:12, 1.44s/it]
89%|βββββββββ | 1447/1621 [39:23<04:12, 1.45s/it]
89%|βββββββββ | 1448/1621 [39:24<04:10, 1.45s/it]
89%|βββββββββ | 1449/1621 [39:26<04:11, 1.46s/it]
89%|βββββββββ | 1450/1621 [39:27<04:14, 1.49s/it]
89%|βββββββββ | 1450/1621 [39:27<04:14, 1.49s/it]
90%|βββββββββ | 1451/1621 [39:29<04:09, 1.47s/it]
90%|βββββββββ | 1452/1621 [39:30<04:06, 1.46s/it]
90%|βββββββββ | 1453/1621 [39:32<04:14, 1.52s/it]
90%|βββββββββ | 1454/1621 [39:34<04:20, 1.56s/it]
90%|βββββββββ | 1455/1621 [39:35<04:12, 1.52s/it]
90%|βββββββββ | 1456/1621 [39:36<04:06, 1.49s/it]
90%|βββββββββ | 1457/1621 [39:38<04:02, 1.48s/it]
90%|βββββββββ | 1458/1621 [39:39<04:03, 1.49s/it]
90%|ββββββββ |
|
|
0: {'loss': 0.2494, 'grad_norm': 0.2790640815194346, 'learning_rate': 3.031999775519685e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.9} |
|
|
0: {'loss': 0.2612, 'grad_norm': 0.32053738639864715, 'learning_rate': 2.7971204447375534e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.91} |
|
|
0: β | 1459/1621 [39:41<03:57, 1.47s/it]
90%|βββββββββ | 1460/1621 [39:42<03:54, 1.46s/it]
90%|βββββββββ | 1460/1621 [39:42<03:54, 1.46s/it]
90%|βββββββββ | 1461/1621 [39:44<03:52, 1.45s/it]
90%|βββββββββ | 1462/1621 [39:45<03:50, 1.45s/it]
90%|βββββββββ | 1463/1621 [39:47<03:47, 1.44s/it]
90%|βββββββββ | 1464/1621 [39:48<03:54, 1.49s/it]
90%|βββββββββ | 1465/1621 [39:50<04:03, 1.56s/it]
90%|βββββββββ | 1466/1621 [39:52<04:06, 1.59s/it]
90%|βββββββββ | 1467/1621 [39:53<03:56, 1.54s/it]
91%|βββββββββ | 1468/1621 [39:55<03:57, 1.55s/it]
91%|βββββββββ | 1469/1621 [39:56<03:50, 1.51s/it]
91%|βββββββββ | 1470/1621 [39:57<03:44, 1.48s/it]
91%|βββββββββ | 14 |
|
|
0: {'loss': 0.2583, 'grad_norm': 0.30047512075465854, 'learning_rate': 2.561724852502291e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.91} |
|
|
0: 70/1621 [39:57<03:44, 1.48s/it]
91%|βββββββββ | 1471/1621 [39:59<03:49, 1.53s/it]
91%|βββββββββ | 1472/1621 [40:00<03:45, 1.51s/it]
91%|βββββββββ | 1473/1621 [40:02<03:40, 1.49s/it]
91%|βββββββββ | 1474/1621 [40:03<03:35, 1.47s/it]
91%|βββββββββ | 1475/1621 [40:05<03:31, 1.45s/it]
91%|βββββββββ | 1476/1621 [40:06<03:30, 1.45s/it]
91%|βββββββββ | 1477/1621 [40:08<03:28, 1.45s/it]
91%|βββββββββ | 1478/1621 [40:09<03:25, 1.44s/it]
91%|βββββββββ | 1479/1621 [40:10<03:23, 1.43s/it]
91%|ββββββββββ| 1480/1621 [40:12<03:21, 1.43s/it]
91%|ββββββββββ| 1480/1621 [40:12<03:21, 1.43s/it]
91%|ββββββββββ| 1481/1621 [40:13<03:23, 1.45s/it]
91%|ββββββββββ| 1482/1621 [40:15<03:22, 1.45s/it]
91%|βββββ |
|
|
0: {'loss': 0.2514, 'grad_norm': 0.2898197692202884, 'learning_rate': 2.3283920421821194e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.92} |
|
|
0: βββββ| 1483/1621 [40:16<03:19, 1.44s/it]
92%|ββββββββββ| 1484/1621 [40:18<03:19, 1.46s/it]
92%|ββββββββββ| 1485/1621 [40:19<03:19, 1.47s/it]
92%|ββββββββββ| 1486/1621 [40:21<03:16, 1.45s/it]
92%|ββββββββββ| 1487/1621 [40:22<03:27, 1.55s/it]
92%|ββββββββββ| 1488/1621 [40:24<03:21, 1.51s/it]
92%|ββββββββββ| 1489/1621 [40:25<03:15, 1.48s/it]
92%|ββββββββββ| 1490/1621 [40:27<03:15, 1.49s/it]
92%|ββββββββββ| 1490/1621 [40:27<03:15, 1.49s/it]
92%|ββββββββββ| 1491/1621 [40:28<03:11, 1.47s/it]
92%|ββββββββββ| 1492/1621 [40:30<03:08, 1.46s/it]
92%|ββββββββββ| 1493/1621 [40:31<03:06, 1.46s/it]
92%|ββββββββββ| 1494/1621 [40:33<03:03, 1.44s/it]
92%|ββββββββββ| 1495/1621 [40:34<0 |
|
|
0: {'loss': 0.2493, 'grad_norm': 0.2788921768117285, 'learning_rate': 2.099678456874939e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.93} |
|
|
0: 3:07, 1.49s/it]
92%|ββββββββββ| 1496/1621 [40:36<03:03, 1.47s/it]
92%|ββββββββββ| 1497/1621 [40:37<03:02, 1.47s/it]
92%|ββββββββββ| 1498/1621 [40:38<02:59, 1.46s/it]
92%|ββββββββββ| 1499/1621 [40:40<02:56, 1.44s/it]
93%|ββββββββββ| 1500/1621 [40:41<02:54, 1.44s/it]
93%|ββββββββββ| 1500/1621 [40:41<02:54, 1.44s/it]
93%|ββββββββββ| 1501/1621 [40:43<02:52, 1.44s/it]
93%|ββββββββββ| 1502/1621 [40:44<02:50, 1.43s/it]
93%|ββββββββββ| 1503/1621 [40:46<02:48, 1.43s/it]
93%|ββββββββββ| 1504/1621 [40:47<02:52, 1.47s/it]
93%|ββββββββββ| 1505/1621 [40:49<02:52, 1.48s/it]
93%|ββββββββββ| 1506/1621 [40:50<02:48, 1.47s/it]
93%|ββββββββββ| 1507/1621 [40:51<02:45, 1.45s/it]
93%|ββββοΏ½ |
|
|
0: {'loss': 0.2549, 'grad_norm': 0.292809089573132, 'learning_rate': 1.8780899304827687e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.93} |
|
|
0: οΏ½οΏ½βββββ| 1508/1621 [40:53<02:42, 1.44s/it]
93%|ββββββββββ| 1509/1621 [40:55<02:49, 1.51s/it]
93%|ββββββββββ| 1510/1621 [40:56<02:45, 1.49s/it]
93%|ββββββββββ| 1510/1621 [40:56<02:45, 1.49s/it]
93%|ββββββββββ| 1511/1621 [40:57<02:43, 1.48s/it]
93%|ββββββββββ| 1512/1621 [40:59<02:41, 1.48s/it]
93%|ββββββββββ| 1513/1621 [41:00<02:37, 1.46s/it]
93%|ββββββββββ| 1514/1621 [41:02<02:35, 1.45s/it]
93%|ββββββββββ| 1515/1621 [41:03<02:35, 1.47s/it]
94%|ββββββββββ| 1516/1621 [41:05<02:33, 1.46s/it]
94%|ββββββββββ| 1517/1621 [41:06<02:31, 1.45s/it]
94%|ββββββββββ| 1518/1621 [41:08<02:28, 1.44s/it]
94%|ββββββββββ| 1519/1621 [41:09<02:26, 1.44s/it]
94%|ββββββββββ| 1520/1621 [41:11 |
|
|
0: {'loss': 0.2564, 'grad_norm': 0.3019723689156526, 'learning_rate': 1.6660542332711405e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.94} |
|
|
0: {'loss': 0.2615, 'grad_norm': 0.28458470847453754, 'learning_rate': 1.465894472710029e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.94} |
|
|
0: <02:26, 1.45s/it]
94%|ββββββββββ| 1520/1621 [41:11<02:26, 1.45s/it]
94%|ββββββββββ| 1521/1621 [41:12<02:26, 1.47s/it]
94%|ββββββββββ| 1522/1621 [41:14<02:37, 1.59s/it]
94%|ββββββββββ| 1523/1621 [41:15<02:31, 1.55s/it]
94%|ββββββββββ| 1524/1621 [41:17<02:29, 1.54s/it]
94%|ββββββββββ| 1525/1621 [41:18<02:24, 1.50s/it]
94%|ββββββββββ| 1526/1621 [41:20<02:20, 1.47s/it]
94%|ββββββββββ| 1527/1621 [41:21<02:17, 1.46s/it]
94%|ββββββββββ| 1528/1621 [41:23<02:14, 1.44s/it]
94%|ββββββββββ| 1529/1621 [41:24<02:12, 1.44s/it]
94%|ββββββββββ| 1530/1621 [41:25<02:10, 1.44s/it]
94%|ββββββββββ| 1530/1621 [41:25<02:10, 1.44s/it]
94%|ββββββββββ |
|
|
0: {'loss': 0.2471, 'grad_norm': 0.28628626975937854, 'learning_rate': 1.2798036410222628e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.95} |
|
|
0: | 1531/1621 [41:27<02:11, 1.46s/it]
95%|ββββββββββ| 1532/1621 [41:28<02:08, 1.45s/it]
95%|ββββββββββ| 1533/1621 [41:30<02:10, 1.49s/it]
95%|ββββββββββ| 1534/1621 [41:31<02:08, 1.47s/it]
95%|ββββββββββ| 1535/1621 [41:33<02:10, 1.51s/it]
95%|ββββββββββ| 1536/1621 [41:34<02:06, 1.48s/it]
95%|ββββββββββ| 1537/1621 [41:36<02:13, 1.59s/it]
95%|ββββββββββ| 1538/1621 [41:38<02:14, 1.62s/it]
95%|ββββββββββ| 1539/1621 [41:39<02:08, 1.56s/it]
95%|ββββββββββ| 1540/1621 [41:41<02:04, 1.53s/it]
95%|ββββββββββ| 1540/1621 [41:41<02:04, 1.53s/it]
95%|ββββββββββ| 1541/1621 [41:42<02:00, 1.51s/it]
95%|ββββββββββ| 1542/1621 [41:44<01:56, 1.48s/it]
95%|ββββββββββ| 1543/1621 [41:45<01:53, 1.46s/it |
|
|
0: {'loss': 0.2498, 'grad_norm': 0.2849557358168061, 'learning_rate': 1.1098205883018246e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.96} |
|
|
0: ]
95%|ββββββββββ| 1544/1621 [41:46<01:51, 1.44s/it]
95%|ββββββββββ| 1545/1621 [41:48<01:49, 1.44s/it]
95%|ββββββββββ| 1546/1621 [41:49<01:47, 1.44s/it]
95%|ββββββββββ| 1547/1621 [41:51<01:46, 1.43s/it]
95%|ββββββββββ| 1548/1621 [41:52<01:44, 1.43s/it]
96%|ββββββββββ| 1549/1621 [41:54<01:42, 1.43s/it]
96%|ββββββββββ| 1550/1621 [41:55<01:42, 1.44s/it]
96%|ββββββββββ| 1550/1621 [41:55<01:42, 1.44s/it]
96%|ββββββββββ| 1551/1621 [41:57<01:40, 1.43s/it]
96%|ββββββββββ| 1552/1621 [41:58<01:42, 1.49s/it]
96%|ββββββββββ| 1553/1621 [42:00<01:40, 1.47s/it]
96%|ββββββββββ| 1554/1621 [42:01<01:38, 1.47s/it]
96%|ββββββββββ| 1555/1621 [42:02<01:36, 1.47s/it]
96%|βββββββββοΏ½ |
|
|
0: {'loss': 0.2488, 'grad_norm': 0.28016585021963597, 'learning_rate': 9.578076844455587e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.96} |
|
|
0: οΏ½οΏ½| 1556/1621 [42:04<01:34, 1.46s/it]
96%|ββββββββββ| 1557/1621 [42:05<01:32, 1.44s/it]
96%|ββββββββββ| 1558/1621 [42:07<01:30, 1.44s/it]
96%|ββββββββββ| 1559/1621 [42:08<01:31, 1.48s/it]
96%|ββββββββββ| 1560/1621 [42:10<01:30, 1.48s/it]
96%|ββββββββββ| 1560/1621 [42:10<01:30, 1.48s/it]
96%|ββββββββββ| 1561/1621 [42:11<01:27, 1.46s/it]
96%|ββββββββββ| 1562/1621 [42:13<01:25, 1.44s/it]
96%|ββββββββββ| 1563/1621 [42:14<01:24, 1.45s/it]
96%|ββββββββββ| 1564/1621 [42:16<01:22, 1.45s/it]
97%|ββββββββββ| 1565/1621 [42:17<01:20, 1.44s/it]
97%|ββββββββββ| 1566/1621 [42:18<01:19, 1.44s/it]
97%|ββββββββββ| 1567/1621 [42:20<01:17, 1.43s/it]
97%|ββββββββββ| 1568/1621 [42:21<01:16, 1.44s/ |
|
|
0: {'loss': 0.2597, 'grad_norm': 0.2927629804951191, 'learning_rate': 8.254304146388603e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.97} |
|
|
0: it]
97%|ββββββββββ| 1569/1621 [42:23<01:15, 1.46s/it]
97%|ββββββββββ| 1570/1621 [42:24<01:16, 1.50s/it]
97%|ββββββββββ| 1570/1621 [42:24<01:16, 1.50s/it]
97%|ββββββββββ| 1571/1621 [42:26<01:13, 1.48s/it]
97%|ββββββββββ| 1572/1621 [42:27<01:13, 1.50s/it]
97%|ββββββββββ| 1573/1621 [42:29<01:11, 1.49s/it]
97%|ββββββββββ| 1574/1621 [42:30<01:09, 1.48s/it]
97%|ββββββββββ| 1575/1621 [42:32<01:07, 1.46s/it]
97%|ββββββββββ| 1576/1621 [42:33<01:05, 1.46s/it]
97%|ββββββββββ| 1577/1621 [42:35<01:04, 1.45s/it]
97%|ββββββββββ| 1578/1621 [42:36<01:02, 1.45s/it]
97%|ββββββββββ| 1579/1621 [42:37<01:00, 1.44s/it]
97%|ββββββββββ| 1580/1621 [42:39<00:58, 1.44s/it]
|
|
|
0: {'loss': 0.2517, 'grad_norm': 0.2752181427419336, 'learning_rate': 7.141391319514565e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.97} |
|
|
0: {'loss': 0.2526, 'grad_norm': 0.2740685329096038, 'learning_rate': 6.251531669656679e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.98} |
|
|
0:
97%|ββββββββββ| 1580/1621 [42:39<00:58, 1.44s/it]
98%|ββββββββββ| 1581/1621 [42:40<00:58, 1.46s/it]
98%|ββββββββββ| 1582/1621 [42:42<00:56, 1.44s/it]
98%|ββββββββββ| 1583/1621 [42:43<00:54, 1.44s/it]
98%|ββββββββββ| 1584/1621 [42:45<00:53, 1.44s/it]
98%|ββββββββββ| 1585/1621 [42:46<00:51, 1.44s/it]
98%|ββββββββββ| 1586/1621 [42:48<00:50, 1.43s/it]
98%|ββββββββββ| 1587/1621 [42:49<00:49, 1.44s/it]
98%|ββββββββββ| 1588/1621 [42:51<00:48, 1.48s/it]
98%|ββββββββββ| 1589/1621 [42:52<00:46, 1.46s/it]
98%|ββββββββββ| 1590/1621 [42:53<00:44, 1.45s/it]
98%|ββββββββββ| 1590/1621 [42:53<00:44, 1.45s/it]
98%|ββββββββββ| 1591/1621 [42:55<00:43, 1.45s/it]
98%|ββοΏ½ |
|
|
0: {'loss': 0.2544, 'grad_norm': 0.29567761780713164, 'learning_rate': 5.594474685353894e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.99} |
|
|
0: οΏ½βββββββ| 1592/1621 [42:56<00:41, 1.44s/it]
98%|ββββββββββ| 1593/1621 [42:58<00:41, 1.47s/it]
98%|ββββββββββ| 1594/1621 [43:00<00:41, 1.55s/it]
98%|ββββββββββ| 1595/1621 [43:01<00:39, 1.50s/it]
98%|ββββββββββ| 1596/1621 [43:02<00:37, 1.50s/it]
99%|ββββββββββ| 1597/1621 [43:04<00:35, 1.48s/it]
99%|ββββββββββ| 1598/1621 [43:05<00:33, 1.46s/it]
99%|ββββββββββ| 1599/1621 [43:07<00:32, 1.46s/it]
99%|ββββββββββ| 1600/1621 [43:08<00:30, 1.45s/it]
99%|ββββββββββ| 1600/1621 [43:08<00:30, 1.45s/it]
99%|ββββββββββ| 1601/1621 [43:10<00:29, 1.47s/it]
99%|ββββββββββ| 1602/1621 [43:11<00:27, 1.45s/it]
99%|ββββββββββ| 1603/1621 [43:13<00:26, 1.48s/it]
99%|ββββββββββ| 1604/1621 [ |
|
|
0: {'loss': 0.2525, 'grad_norm': 0.28413860743448904, 'learning_rate': 5.177419220424251e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.99} |
|
|
0: 43:14<00:24, 1.47s/it]
99%|ββββββββββ| 1605/1621 [43:16<00:24, 1.52s/it]
99%|ββββββββββ| 1606/1621 [43:17<00:22, 1.50s/it]
99%|ββββββββββ| 1607/1621 [43:19<00:20, 1.50s/it]
99%|ββββββββββ| 1608/1621 [43:20<00:19, 1.47s/it]
99%|ββββββββββ| 1609/1621 [43:22<00:18, 1.51s/it]
99%|ββββββββββ| 1610/1621 [43:23<00:16, 1.49s/it]
99%|ββββββββββ| 1610/1621 [43:23<00:16, 1.49s/it]
99%|ββββββββββ| 1611/1621 [43:25<00:14, 1.47s/it]
99%|ββββββββββ| 1612/1621 [43:26<00:13, 1.46s/it]
100%|ββββββββββ| 1613/1621 [43:27<00:11, 1.47s/it]
100%|ββββββββββ| 1614/1621 [43:29<00:10, 1.45s/it]
100%|ββββββββββ| 1615/1621 [43:30<00:08, 1.45s/it]
100%|ββββββββββ| 1616/1621 [43:32<00:07, 1.44s/it]
100%|ββ |
|
|
0: {'loss': 0.2512, 'grad_norm': 0.26094743319216807, 'learning_rate': 5.004934621815976e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 1.0} |
|
|
0: [2025-09-02 19:31:38,125] [INFO] [axolotl.core.trainers.base._save:613] [PID:1478787] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0/checkpoint-1621[39m |
|
|
0: [2025-09-02 19:31:42,994] [INFO] [axolotl.core.trainers.base._save:662] [PID:1478787] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`[39m |
|
|
0: {'train_runtime': 2637.1175, 'train_samples_per_second': 9.835, 'train_steps_per_second': 0.615, 'train_loss': 0.26498376052737016, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 1.0} |
|
|
0: ββββββββ| 1617/1621 [43:33<00:05, 1.44s/it]
100%|ββββββββββ| 1618/1621 [43:35<00:04, 1.44s/it]
100%|ββββββββββ| 1619/1621 [43:36<00:02, 1.44s/it]
100%|ββββββββββ| 1620/1621 [43:38<00:01, 1.45s/it]
100%|ββββββββββ| 1620/1621 [43:38<00:01, 1.45s/it]
100%|ββββββββββ| 1621/1621 [43:48<00:00, 4.02s/it]
100%|ββββββββββ| 1621/1621 [43:57<00:00, 4.02s/it]
100%|ββββββββββ| 1621/1621 [43:57<00:00, 1.63s/it] |
|
|
0: [2025-09-02 19:31:45,684] [INFO] [axolotl.train.save_trained_model:228] [PID:1478787] [RANK:0] Training completed! Saving trained model to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0.[39m |
|
|
0: [2025-09-02 19:31:47,159] [INFO] [axolotl.core.trainers.base._save:613] [PID:1478787] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0[39m |
|
|
0: [2025-09-02 19:31:51,888] [INFO] [axolotl.core.trainers.base._save:662] [PID:1478787] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`[39m |
|
|
0: [2025-09-02 19:31:52,303] [INFO] [axolotl.train.save_trained_model:350] [PID:1478787] [RANK:0] Model successfully saved to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0[39m |
|
|
|