Text Generation
Transformers
TensorBoard
Safetensors
English
qwen2
Generated from Trainer
conversational
text-generation-inference
File size: 221,172 Bytes
ea8ce34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
2: W0902 18:42:01.546000 3691206 torch/distributed/run.py:792] 
2: W0902 18:42:01.546000 3691206 torch/distributed/run.py:792] *****************************************
2: W0902 18:42:01.546000 3691206 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
2: W0902 18:42:01.546000 3691206 torch/distributed/run.py:792] *****************************************
0: W0902 18:42:05.937000 1478709 torch/distributed/run.py:792] 
0: W0902 18:42:05.937000 1478709 torch/distributed/run.py:792] *****************************************
0: W0902 18:42:05.937000 1478709 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
0: W0902 18:42:05.937000 1478709 torch/distributed/run.py:792] *****************************************
3: W0902 18:42:05.939000 368050 torch/distributed/run.py:792] 
3: W0902 18:42:05.939000 368050 torch/distributed/run.py:792] *****************************************
3: W0902 18:42:05.939000 368050 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
3: W0902 18:42:05.939000 368050 torch/distributed/run.py:792] *****************************************
1: W0902 18:42:05.948000 669827 torch/distributed/run.py:792] 
1: W0902 18:42:05.948000 669827 torch/distributed/run.py:792] *****************************************
1: W0902 18:42:05.948000 669827 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
1: W0902 18:42:05.948000 669827 torch/distributed/run.py:792] *****************************************
2: [2025-09-02 18:42:25,579] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:3691281] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing`
2: [2025-09-02 18:42:25,580] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:3691281] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing
2: [2025-09-02 18:42:29,081] [INFO] [axolotl.utils.data.sft._load_raw_datasets:314] [PID:3691281] [RANK:0] Loading raw datasets...
2: [2025-09-02 18:42:29,322] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:88] [PID:3691281] [RANK:0] Loading dataset: /lustre/fswork/projects/rech/qwv/udv55np/dataset/math/hf/no_thinking_text/generator/default-d32b2cae8ea7e541/0.0.0 with base_type: chat_template and prompt_style: None
0: [2025-09-02 18:42:33,914] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:1478787] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing`
0: [2025-09-02 18:42:33,914] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:1478787] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing
1: [2025-09-02 18:42:33,920] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:669903] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing`
1: [2025-09-02 18:42:33,920] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:669903] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing
3: [2025-09-02 18:42:33,923] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:368126] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing`
3: [2025-09-02 18:42:33,924] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:368126] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing
0: [2025-09-02 18:42:37,930] [INFO] [axolotl.cli.config.load_cfg:245] [PID:1478787] [RANK:0] config:
0: {
0:   "activation_offloading": false,
0:   "auto_resume_from_checkpoints": true,
0:   "axolotl_config_path": "/lustre/fswork/projects/rech/dgo/udv55np/train/tmp/1756826505740523622.yaml",
0:   "base_model": "/lustre/fswork/projects/rech/qwv/udv55np/Qwen/Qwen2.5-3B_ift",
0:   "base_model_config": "/lustre/fswork/projects/rech/qwv/udv55np/Qwen/Qwen2.5-3B_ift",
0:   "batch_size": 16,
0:   "bf16": true,
0:   "capabilities": {
0:     "bf16": true,
0:     "compute_capability": "sm_90",
0:     "fp8": false,
0:     "n_gpu": 16,
0:     "n_node": 1
0:   },
0:   "chat_template": "qwen_25",
0:   "context_parallel_size": 1,
0:   "dataloader_num_workers": 16,
0:   "dataloader_pin_memory": true,
0:   "dataloader_prefetch_factor": 256,
0:   "dataset_prepared_path": "/lustre/fsn1/projects/rech/dgo/udv55np/dataset_math/Qwen3-235B-A22B/0",
0:   "dataset_processes": 192,
0:   "datasets": [
0:     {
0:       "chat_template": "tokenizer_default",
0:       "field_messages": "conversations",
0:       "message_property_mappings": {
0:         "content": "content",
0:         "role": "role"
0:       },
0:       "path": "/lustre/fswork/projects/rech/qwv/udv55np/dataset/math/hf/no_thinking_text/generator/default-d32b2cae8ea7e541/0.0.0",
0:       "trust_remote_code": false,
0:       "type": "chat_template"
0:     }
0:   ],
0:   "ddp": true,
0:   "deepspeed": {
0:     "bf16": {
0:       "enabled": true
0:     },
0:     "gradient_accumulation_steps": "auto",
0:     "gradient_clipping": "auto",
0:     "train_batch_size": "auto",
0:     "train_micro_batch_size_per_gpu": "auto",
0:     "wall_clock_breakdown": false,
0:     "zero_optimization": {
0:       "contiguous_gradients": true,
0:       "overlap_comm": true,
0:       "reduce_bucket_size": "auto",
0:       "stage": 3,
0:       "stage3_gather_16bit_weights_on_model_save": true,
0:       "stage3_param_persistence_threshold": "auto",
0:       "stage3_prefetch_bucket_size": "auto",
0:       "sub_group_size": 0
0:     }
0:   },
0:   "device": "cuda:0",
0:   "device_map": {
0:     "": 0
0:   },
0:   "dion_rank_fraction": 1.0,
0:   "dion_rank_multiple_of": 1,
0:   "env_capabilities": {
0:     "torch_version": "2.6.0"
0:   },
0:   "eval_batch_size": 1,
0:   "eval_causal_lm_metrics": [
0:     "sacrebleu",
0:     "comet",
0:     "ter",
0:     "chrf"
0:   ],
0:   "eval_max_new_tokens": 128,
0:   "eval_sample_packing": true,
0:   "eval_table_size": 0,
0:   "evals_per_epoch": 0,
0:   "flash_attention": true,
0:   "fp16": false,
0:   "gradient_accumulation_steps": 1,
0:   "gradient_checkpointing": true,
0:   "gradient_checkpointing_kwargs": {
0:     "use_reentrant": true
0:   },
0:   "learning_rate": 5e-06,
0:   "lisa_layers_attribute": "model.layers",
0:   "load_best_model_at_end": false,
0:   "load_in_4bit": false,
0:   "load_in_8bit": false,
0:   "local_rank": 0,
0:   "logging_steps": 10,
0:   "lora_dropout": 0.0,
0:   "loraplus_lr_embedding": 1e-06,
0:   "lr_scheduler": "warmup_stable_decay",
0:   "lr_scheduler_kwargs": {
0:     "min_lr_ratio": 0.1,
0:     "num_decay_steps": 300
0:   },
0:   "max_prompt_len": 512,
0:   "mean_resizing_embeddings": false,
0:   "micro_batch_size": 1,
0:   "model_config_type": "qwen2",
0:   "num_epochs": 1.0,
0:   "optimizer": "adamw_torch_fused",
0:   "output_dir": "/lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0",
0:   "pad_to_sequence_len": true,
0:   "pretrain_multipack_attn": true,
0:   "pretrain_multipack_buffer_size": 10000,
0:   "profiler_steps_start": 0,
0:   "qlora_sharded_model_loading": false,
0:   "ray_num_workers": 1,
0:   "resources_per_worker": {
0:     "GPU": 1
0:   },
0:   "sample_packing": true,
0:   "sample_packing_bin_size": 200,
0:   "sample_packing_group_size": 100000,
0:   "save_only_model": false,
0:   "save_safetensors": true,
0:   "save_steps": 0.2,
0:   "save_total_limit": 20,
0:   "sequence_len": 16384,
0:   "shuffle_before_merging_datasets": false,
0:   "shuffle_merged_datasets": true,
0:   "skip_prepare_dataset": false,
0:   "special_tokens": {
0:     "bos_token": "<|im_start|>",
0:     "eos_token": "<|im_end|>",
0:     "pad_token": "<|endoftext|>"
0:   },
0:   "strict": false,
0:   "tensor_parallel_size": 1,
0:   "tf32": false,
0:   "tiled_mlp_use_original_mlp": true,
0:   "tokenizer_config": "/lustre/fswork/projects/rech/qwv/udv55np/Qwen/Qwen2.5-3B_ift",
0:   "torch_dtype": "torch.bfloat16",
0:   "train_on_inputs": false,
0:   "trl": {
0:     "log_completions": false,
0:     "mask_truncated_completions": false,
0:     "ref_model_mixup_alpha": 0.9,
0:     "ref_model_sync_steps": 64,
0:     "scale_rewards": true,
0:     "sync_ref_model": false,
0:     "use_vllm": false,
0:     "vllm_server_host": "0.0.0.0",
0:     "vllm_server_port": 8000
0:   },
0:   "use_ray": false,
0:   "use_tensorboard": true,
0:   "val_set_size": 0.0,
0:   "vllm": {
0:     "device": "auto",
0:     "dtype": "auto",
0:     "gpu_memory_utilization": 0.9,
0:     "host": "0.0.0.0",
0:     "port": 8000
0:   },
0:   "warmup_steps": 150,
0:   "weight_decay": 0.0,
0:   "world_size": 16
0: }
0: [2025-09-02 18:42:37,931] [INFO] [axolotl.cli.checks.check_user_token:35] [PID:1478787] [RANK:0] Skipping HuggingFace token verification because HF_HUB_OFFLINE is set to True. Only local files will be used.
2: 
Tokenizing Prompts (num_proc=192):   0%|          | 0/321773 [00:00<?, ? examples/s]
Tokenizing Prompts (num_proc=192):   0%|          | 1000/321773 [00:07<41:18, 129.42 examples/s]
Tokenizing Prompts (num_proc=192):   1%|          | 2000/321773 [00:08<18:13, 292.55 examples/s]
Tokenizing Prompts (num_proc=192):   1%|          | 3000/321773 [00:08<10:21, 513.15 examples/s]
Tokenizing Prompts (num_proc=192):   2%|▏         | 6000/321773 [00:08<03:41, 1423.90 examples/s]
Tokenizing Prompts (num_proc=192):   2%|▏         | 7000/321773 [00:08<03:18, 1581.98 examples/s]
Tokenizing Prompts (num_proc=192):   2%|▏         | 8000/321773 [00:09<02:36, 2010.75 examples/s]
Tokenizing Prompts (num_proc=192):   3%|β–Ž         | 9000/321773 [00:09<02:11, 2372.83 examples/s]
Tokenizing Prompts (num_proc=192):   3%|β–Ž         | 11000/321773 [00:09<01:28, 3504.96 examples/s]
Tokenizing Prompts (num_proc=192):   4%|β–Ž         | 12000/321773 [00:09<01:31, 3397.37 examples/s]
Tokenizing Prompts (num_proc=192):   4%|▍  
2:        | 14000/321773 [00:09<01:00, 5073.56 examples/s]
Tokenizing Prompts (num_proc=192):   6%|β–Œ         | 18000/321773 [00:10<00:32, 9427.11 examples/s]
Tokenizing Prompts (num_proc=192):   7%|β–‹         | 21000/321773 [00:10<00:24, 12483.15 examples/s]
Tokenizing Prompts (num_proc=192):   7%|β–‹         | 24000/321773 [00:10<00:30, 9707.81 examples/s] 
Tokenizing Prompts (num_proc=192):   8%|β–Š         | 26000/321773 [00:10<00:35, 8384.78 examples/s]
Tokenizing Prompts (num_proc=192):   9%|β–Š         | 28000/321773 [00:11<00:38, 7575.54 examples/s]
Tokenizing Prompts (num_proc=192):   9%|β–‰         | 30000/321773 [00:11<00:37, 7742.50 examples/s]
Tokenizing Prompts (num_proc=192):  10%|β–‰         | 32000/321773 [00:11<00:31, 9336.47 examples/s]
Tokenizing Prompts (num_proc=192):  11%|β–ˆ         | 34000/321773 [00:11<00:26, 10677.32 examples/s]
Tokenizing Prompts (num_proc=192):  11%|β–ˆ         | 36000/321773 [00:11<00:28, 10117.68 examples/s]
Tokenizing Prompts (num_proc=192):  12%|β–ˆβ–        |
2:  38000/321773 [00:12<00:25, 11287.24 examples/s]
Tokenizing Prompts (num_proc=192):  12%|β–ˆβ–        | 39676/321773 [00:12<00:25, 11117.33 examples/s]
Tokenizing Prompts (num_proc=192):  14%|β–ˆβ–Ž        | 43676/321773 [00:12<00:28, 9591.61 examples/s] 
Tokenizing Prompts (num_proc=192):  14%|β–ˆβ–        | 45352/321773 [00:12<00:33, 8151.69 examples/s]
Tokenizing Prompts (num_proc=192):  14%|β–ˆβ–        | 46352/321773 [00:13<00:35, 7686.92 examples/s]
Tokenizing Prompts (num_proc=192):  15%|β–ˆβ–        | 47352/321773 [00:13<00:36, 7417.33 examples/s]
Tokenizing Prompts (num_proc=192):  15%|β–ˆβ–Œ        | 48352/321773 [00:13<00:47, 5696.50 examples/s]
Tokenizing Prompts (num_proc=192):  16%|β–ˆβ–‹        | 52352/321773 [00:13<00:33, 7979.90 examples/s]
Tokenizing Prompts (num_proc=192):  17%|β–ˆβ–‹        | 55352/321773 [00:14<00:26, 10137.84 examples/s]
Tokenizing Prompts (num_proc=192):  18%|β–ˆβ–Š        | 57028/321773 [00:14<00:25, 10193.21 examples/s]
Tokenizing Prompts (num_proc=192):  18%|β–ˆοΏ½
2: οΏ½οΏ½        | 58704/321773 [00:14<00:25, 10176.02 examples/s]
Tokenizing Prompts (num_proc=192):  19%|β–ˆβ–Š        | 60056/321773 [00:14<00:24, 10519.60 examples/s]
Tokenizing Prompts (num_proc=192):  19%|β–ˆβ–‰        | 61732/321773 [00:14<00:32, 7891.54 examples/s] 
Tokenizing Prompts (num_proc=192):  20%|β–ˆβ–‰        | 63732/321773 [00:15<00:32, 7889.18 examples/s]
Tokenizing Prompts (num_proc=192):  21%|β–ˆβ–ˆ        | 67760/321773 [00:15<00:24, 10331.09 examples/s]
Tokenizing Prompts (num_proc=192):  23%|β–ˆβ–ˆβ–Ž       | 72788/321773 [00:15<00:18, 13539.21 examples/s]
Tokenizing Prompts (num_proc=192):  24%|β–ˆβ–ˆβ–       | 76492/321773 [00:15<00:17, 13854.25 examples/s]
Tokenizing Prompts (num_proc=192):  25%|β–ˆβ–ˆβ–       | 79844/321773 [00:16<00:17, 13805.67 examples/s]
Tokenizing Prompts (num_proc=192):  26%|β–ˆβ–ˆβ–‹       | 84520/321773 [00:16<00:15, 15440.07 examples/s]
Tokenizing Prompts (num_proc=192):  28%|β–ˆβ–ˆβ–Š       | 88872/321773 [00:16<00:14, 16398.34 examples/s]
Tokenizing Prompts 
2: (num_proc=192):  28%|β–ˆβ–ˆβ–Š       | 91548/321773 [00:16<00:15, 14601.55 examples/s]
Tokenizing Prompts (num_proc=192):  30%|β–ˆβ–ˆβ–‰       | 95548/321773 [00:17<00:14, 15426.17 examples/s]
Tokenizing Prompts (num_proc=192):  30%|β–ˆβ–ˆβ–ˆ       | 97900/321773 [00:17<00:16, 13621.50 examples/s]
Tokenizing Prompts (num_proc=192):  31%|β–ˆβ–ˆβ–ˆβ–      | 100928/321773 [00:17<00:16, 13303.97 examples/s]
Tokenizing Prompts (num_proc=192):  33%|β–ˆβ–ˆβ–ˆβ–Ž      | 105280/321773 [00:17<00:13, 16357.53 examples/s]
Tokenizing Prompts (num_proc=192):  34%|β–ˆβ–ˆβ–ˆβ–Ž      | 107956/321773 [00:18<00:16, 13043.48 examples/s]
Tokenizing Prompts (num_proc=192):  34%|β–ˆβ–ˆβ–ˆβ–      | 110308/321773 [00:18<00:15, 13257.97 examples/s]
Tokenizing Prompts (num_proc=192):  35%|β–ˆβ–ˆβ–ˆβ–      | 112308/321773 [00:18<00:17, 11803.94 examples/s]
Tokenizing Prompts (num_proc=192):  35%|β–ˆβ–ˆβ–ˆβ–Œ      | 113660/321773 [00:18<00:20, 9973.30 examples/s] 
Tokenizing Prompts (num_proc=192):  36%|β–ˆβ–ˆβ–ˆβ–Œ      | 115012/3217
2: 73 [00:18<00:23, 8677.05 examples/s]
Tokenizing Prompts (num_proc=192):  36%|β–ˆβ–ˆβ–ˆβ–Œ      | 116364/321773 [00:19<00:26, 7668.05 examples/s]
Tokenizing Prompts (num_proc=192):  37%|β–ˆβ–ˆβ–ˆβ–‹      | 119392/321773 [00:19<00:21, 9367.10 examples/s]
Tokenizing Prompts (num_proc=192):  38%|β–ˆβ–ˆβ–ˆβ–Š      | 121068/321773 [00:19<00:23, 8643.09 examples/s]
Tokenizing Prompts (num_proc=192):  39%|β–ˆβ–ˆβ–ˆβ–‰      | 125420/321773 [00:19<00:16, 11631.24 examples/s]
Tokenizing Prompts (num_proc=192):  39%|β–ˆβ–ˆβ–ˆβ–‰      | 127096/321773 [00:20<00:17, 11027.16 examples/s]
Tokenizing Prompts (num_proc=192):  40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 129096/321773 [00:20<00:18, 10344.84 examples/s]
Tokenizing Prompts (num_proc=192):  41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 131772/321773 [00:20<00:18, 10393.89 examples/s]
Tokenizing Prompts (num_proc=192):  42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 134800/321773 [00:20<00:16, 11581.85 examples/s]
Tokenizing Prompts (num_proc=192):  42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 136476/321773 [00:21<00:18, 10007.02 examples/s]
Token
2: izing Prompts (num_proc=192):  43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 138152/321773 [00:21<00:24, 7598.79 examples/s] 
Tokenizing Prompts (num_proc=192):  44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 142856/321773 [00:21<00:16, 11089.98 examples/s]
Tokenizing Prompts (num_proc=192):  46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 146884/321773 [00:21<00:13, 12887.92 examples/s]
Tokenizing Prompts (num_proc=192):  46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 148912/321773 [00:22<00:14, 11694.80 examples/s]
Tokenizing Prompts (num_proc=192):  47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 152264/321773 [00:22<00:12, 13091.03 examples/s]
Tokenizing Prompts (num_proc=192):  48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 154616/321773 [00:22<00:12, 13770.02 examples/s]
Tokenizing Prompts (num_proc=192):  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 159644/321773 [00:22<00:08, 18029.58 examples/s]
Tokenizing Prompts (num_proc=192):  51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 163672/321773 [00:22<00:07, 20143.92 examples/s]
Tokenizing Prompts (num_proc=192):  52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 166348/321773 [00:22<00:07, 19454.88 examples/s]
Tokenizing Prompts (num_proc
2: =192):  53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 169348/321773 [00:23<00:07, 19451.39 examples/s]
Tokenizing Prompts (num_proc=192):  54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 172700/321773 [00:23<00:07, 19990.67 examples/s]
Tokenizing Prompts (num_proc=192):  54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 175052/321773 [00:23<00:10, 14583.76 examples/s]
Tokenizing Prompts (num_proc=192):  56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 180404/321773 [00:23<00:07, 19633.99 examples/s]
Tokenizing Prompts (num_proc=192):  57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 182756/321773 [00:23<00:07, 18247.45 examples/s]
Tokenizing Prompts (num_proc=192):  58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 185108/321773 [00:24<00:08, 16406.57 examples/s]
Tokenizing Prompts (num_proc=192):  58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 187108/321773 [00:24<00:08, 15492.72 examples/s]
Tokenizing Prompts (num_proc=192):  59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 189460/321773 [00:24<00:08, 15825.05 examples/s]
Tokenizing Prompts (num_proc=192):  59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 191136/321773 [00:24<00:11, 11283.33 examples/s]
Tokenizing Prompts (num_proc=192): 
2:  60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 192488/321773 [00:24<00:12, 10654.24 examples/s]
Tokenizing Prompts (num_proc=192):  60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 194164/321773 [00:24<00:11, 10992.74 examples/s]
Tokenizing Prompts (num_proc=192):  61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 196164/321773 [00:25<00:11, 11377.43 examples/s]
Tokenizing Prompts (num_proc=192):  61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 197516/321773 [00:25<00:14, 8419.88 examples/s] 
Tokenizing Prompts (num_proc=192):  62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 200192/321773 [00:25<00:11, 10506.33 examples/s]
Tokenizing Prompts (num_proc=192):  63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 201544/321773 [00:25<00:11, 10019.21 examples/s]
Tokenizing Prompts (num_proc=192):  63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 203544/321773 [00:25<00:10, 10969.36 examples/s]
Tokenizing Prompts (num_proc=192):  64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 205544/321773 [00:26<00:09, 11724.56 examples/s]
Tokenizing Prompts (num_proc=192):  64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 206896/321773 [00:26<00:13, 8444.78 examples/s] 
Tokenizing Prompts (num_proc=1
2: 92):  65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 209248/321773 [00:26<00:11, 10195.93 examples/s]
Tokenizing Prompts (num_proc=192):  65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 210600/321773 [00:26<00:11, 9721.58 examples/s] 
Tokenizing Prompts (num_proc=192):  66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 212952/321773 [00:26<00:08, 12250.35 examples/s]
Tokenizing Prompts (num_proc=192):  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 214628/321773 [00:26<00:09, 11123.21 examples/s]
Tokenizing Prompts (num_proc=192):  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 216304/321773 [00:27<00:08, 12255.42 examples/s]
Tokenizing Prompts (num_proc=192):  68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 217980/321773 [00:27<00:08, 12081.44 examples/s]
Tokenizing Prompts (num_proc=192):  68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 219656/321773 [00:27<00:08, 12479.56 examples/s]
Tokenizing Prompts (num_proc=192):  69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 222332/321773 [00:27<00:08, 11690.36 examples/s]
Tokenizing Prompts (num_proc=192):  70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 223684/321773 [00:27<00:08, 11144.39 examples/s]
Tokenizing Prompts 
2: (num_proc=192):  70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 226036/321773 [00:27<00:08, 11097.73 examples/s]
Tokenizing Prompts (num_proc=192):  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 227388/321773 [00:28<00:10, 9298.28 examples/s] 
Tokenizing Prompts (num_proc=192):  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 228740/321773 [00:28<00:09, 9665.74 examples/s]
Tokenizing Prompts (num_proc=192):  72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 233092/321773 [00:28<00:06, 14319.83 examples/s]
Tokenizing Prompts (num_proc=192):  73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 234768/321773 [00:28<00:08, 10359.51 examples/s]
Tokenizing Prompts (num_proc=192):  74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 239120/321773 [00:28<00:05, 15792.56 examples/s]
Tokenizing Prompts (num_proc=192):  75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 241472/321773 [00:29<00:06, 13247.23 examples/s]
Tokenizing Prompts (num_proc=192):  76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 243824/321773 [00:29<00:06, 12162.40 examples/s]
Tokenizing Prompts (num_proc=192):  77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 246176/321773 [00:29<00:07, 10770.75 examples/
2: s]
Tokenizing Prompts (num_proc=192):  78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 249880/321773 [00:29<00:05, 12175.96 examples/s]
Tokenizing Prompts (num_proc=192):  78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 251908/321773 [00:30<00:05, 12287.71 examples/s]
Tokenizing Prompts (num_proc=192):  79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 254260/321773 [00:30<00:04, 14138.33 examples/s]
Tokenizing Prompts (num_proc=192):  80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 255936/321773 [00:30<00:06, 9574.12 examples/s] 
Tokenizing Prompts (num_proc=192):  81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 260612/321773 [00:30<00:04, 14495.41 examples/s]
Tokenizing Prompts (num_proc=192):  82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 263288/321773 [00:30<00:04, 13285.76 examples/s]
Tokenizing Prompts (num_proc=192):  83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 265964/321773 [00:31<00:06, 8573.43 examples/s] 
Tokenizing Prompts (num_proc=192):  83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 267640/321773 [00:31<00:05, 9476.53 examples/s]
Tokenizing Prompts (num_proc=192):  84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 270992/321
2: 773 [00:31<00:04, 10546.11 examples/s]
Tokenizing Prompts (num_proc=192):  85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 272668/321773 [00:31<00:04, 10909.82 examples/s]
Tokenizing Prompts (num_proc=192):  85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 274020/321773 [00:32<00:04, 10618.59 examples/s]
Tokenizing Prompts (num_proc=192):  86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 276020/321773 [00:32<00:04, 9375.64 examples/s] 
Tokenizing Prompts (num_proc=192):  87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 279048/321773 [00:32<00:03, 12539.94 examples/s]
Tokenizing Prompts (num_proc=192):  88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 283400/321773 [00:32<00:02, 13538.36 examples/s]
Tokenizing Prompts (num_proc=192):  89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 285400/321773 [00:32<00:02, 14031.28 examples/s]
Tokenizing Prompts (num_proc=192):  89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 287076/321773 [00:32<00:02, 14469.57 examples/s]
Tokenizing Prompts (num_proc=192):  90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 288752/321773 [00:33<00:02, 14466.47 examples/s]
Tokenizing Prompts (num_proc=192)
2: :  90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 290752/321773 [00:33<00:02, 11522.66 examples/s]
Tokenizing Prompts (num_proc=192):  91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 294104/321773 [00:33<00:02, 13660.86 examples/s]
Tokenizing Prompts (num_proc=192):  92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 296456/321773 [00:33<00:01, 12702.93 examples/s]
Tokenizing Prompts (num_proc=192):  93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 298132/321773 [00:33<00:02, 11020.78 examples/s]
Tokenizing Prompts (num_proc=192):  93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 299484/321773 [00:34<00:02, 8140.65 examples/s] 
Tokenizing Prompts (num_proc=192):  93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 300836/321773 [00:34<00:02, 7014.16 examples/s]
Tokenizing Prompts (num_proc=192):  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 302188/321773 [00:34<00:03, 5984.59 examples/s]
Tokenizing Prompts (num_proc=192):  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 303540/321773 [00:35<00:03, 6037.18 examples/s]
Tokenizing Prompts (num_proc=192):  95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 305568/321773 [00:35<0
2: 0:02, 7859.83 examples/s]
Tokenizing Prompts (num_proc=192):  95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 306920/321773 [00:35<00:02, 5864.06 examples/s]
Tokenizing Prompts (num_proc=192):  96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 308272/321773 [00:36<00:03, 4130.62 examples/s]
Tokenizing Prompts (num_proc=192):  96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 309623/321773 [00:36<00:02, 4901.63 examples/s]
Tokenizing Prompts (num_proc=192):  97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 310973/321773 [00:36<00:02, 4353.07 examples/s]
Tokenizing Prompts (num_proc=192):  97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 312323/321773 [00:36<00:01, 5236.88 examples/s]
Tokenizing Prompts (num_proc=192):  97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 313673/321773 [00:37<00:01, 6001.53 examples/s]
Tokenizing Prompts (num_proc=192):  98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 315023/321773 [00:37<00:01, 5721.52 examples/s]
Tokenizing Prompts (num_proc=192):  98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 316373/321773 [00:37<00:01, 5384.63 examples/s]
Tokenizing Prompts (num_proc=192):  99
2: %|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 317048/321773 [00:37<00:00, 5442.78 examples/s]
Tokenizing Prompts (num_proc=192):  99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 317723/321773 [00:37<00:00, 5129.22 examples/s]
Tokenizing Prompts (num_proc=192):  99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 318398/321773 [00:38<00:00, 4903.01 examples/s]
Tokenizing Prompts (num_proc=192): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 321098/321773 [00:38<00:00, 8791.41 examples/s]
Tokenizing Prompts (num_proc=192): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 321773/321773 [00:38<00:00, 8276.39 examples/s]
2: 
Dropping Long Sequences (>16384) (num_proc=192):   0%|          | 0/321773 [00:00<?, ? examples/s]
Dropping Long Sequences (>16384) (num_proc=192):   0%|          | 1000/321773 [00:01<09:32, 560.47 examples/s]
Dropping Long Sequences (>16384) (num_proc=192):   3%|β–Ž         | 11000/321773 [00:01<00:39, 7926.71 examples/s]
Dropping Long Sequences (>16384) (num_proc=192):   7%|β–‹         | 23000/321773 [00:01<00:16, 18603.60 examples/s]
Dropping Long Sequences (>16384) (num_proc=192):  10%|β–‰         | 32084/321773 [00:02<00:16, 18101.27 examples/s]
Dropping Long Sequences (>16384) (num_proc=192):  12%|β–ˆβ–        | 38140/321773 [00:02<00:14, 19642.37 examples/s]
Dropping Long Sequences (>16384) (num_proc=192):  14%|β–ˆβ–Ž        | 43492/321773 [00:02<00:11, 23312.83 examples/s]
Dropping Long Sequences (>16384) (num_proc=192):  15%|β–ˆβ–Œ        | 48844/321773 [00:02<00:10, 26170.39 examples/s]
Dropping Long Sequences (>16384) (num_proc=192):  17%|β–ˆβ–‹        | 53872/321773 [00:03<00:10, 26701.03 exampl
2: es/s]
Dropping Long Sequences (>16384) (num_proc=192):  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 161644/321773 [00:03<00:00, 201063.95 examples/s]
Dropping Long Sequences (>16384) (num_proc=192):  61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 197136/321773 [00:03<00:00, 206298.85 examples/s]
Dropping Long Sequences (>16384) (num_proc=192):  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 227952/321773 [00:03<00:00, 189555.60 examples/s]
Dropping Long Sequences (>16384) (num_proc=192):  79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 254500/321773 [00:03<00:00, 187952.77 examples/s]
Dropping Long Sequences (>16384) (num_proc=192):  87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 278372/321773 [00:03<00:00, 191215.20 examples/s]
Dropping Long Sequences (>16384) (num_proc=192):  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 301186/321773 [00:04<00:00, 178456.22 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 321773/321773 [00:04<00:00, 109467.24 examples/s]
Dropping Long Sequences (>16384) (num_proc=192): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 321773/3
2: 21773 [00:04<00:00, 67525.21 examples/s] 
2: 
Drop Samples with Zero Trainable Tokens (num_proc=192):   0%|          | 0/315947 [00:00<?, ? examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):   0%|          | 1000/315947 [00:01<09:07, 574.79 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):   3%|β–Ž         | 8000/315947 [00:01<00:52, 5839.95 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):   5%|β–Œ         | 16000/315947 [00:01<00:23, 12743.90 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):   8%|β–Š         | 24292/315947 [00:02<00:14, 20794.25 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  10%|β–‰         | 30522/315947 [00:02<00:11, 24083.01 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  11%|β–ˆβ–        | 35814/315947 [00:02<00:09, 28141.59 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  13%|β–ˆβ–Ž        | 41752/315947 [00:02<00:08, 31272.49 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  15
2: %|β–ˆβ–        | 46690/315947 [00:02<00:07, 34698.34 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  16%|β–ˆβ–‹        | 51628/315947 [00:02<00:07, 33735.17 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  18%|β–ˆβ–Š        | 56858/315947 [00:02<00:07, 34159.27 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  19%|β–ˆβ–‰        | 61150/315947 [00:03<00:07, 34975.46 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  21%|β–ˆβ–ˆ        | 66088/315947 [00:03<00:06, 37509.91 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  23%|β–ˆβ–ˆβ–Ž       | 72026/315947 [00:03<00:06, 40401.96 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  24%|β–ˆβ–ˆβ–       | 76610/315947 [00:03<00:05, 41346.15 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  26%|β–ˆβ–ˆβ–Œ       | 81548/315947 [00:03<00:05, 39889.07 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  27%|β–ˆβ–ˆβ–‹       
2: | 86486/315947 [00:03<00:05, 39849.91 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  29%|β–ˆβ–ˆβ–Š       | 90778/315947 [00:03<00:05, 39964.42 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  30%|β–ˆβ–ˆβ–ˆ       | 96362/315947 [00:04<00:08, 24697.63 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 139512/315947 [00:04<00:01, 95467.36 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192):  92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 289887/315947 [00:04<00:00, 372991.79 examples/s]
Drop Samples with Zero Trainable Tokens (num_proc=192): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 315947/315947 [00:05<00:00, 59605.93 examples/s] 
2: 
Add position_id column (Sample Packing) (num_proc=192):   0%|          | 0/315947 [00:00<?, ? examples/s]
Add position_id column (Sample Packing) (num_proc=192):   0%|          | 1000/315947 [00:02<11:28, 457.38 examples/s]
Add position_id column (Sample Packing) (num_proc=192):   6%|β–Œ         | 18000/315947 [00:02<00:27, 10853.73 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  10%|β–‰         | 31000/315947 [00:02<00:13, 20432.30 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  14%|β–ˆβ–        | 44000/315947 [00:02<00:08, 31277.48 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  18%|β–ˆβ–Š        | 55646/315947 [00:02<00:06, 40172.99 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  21%|β–ˆβ–ˆ        | 66584/315947 [00:02<00:05, 43741.25 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  24%|β–ˆβ–ˆβ–       | 75876/315947 [00:03<00:05, 43987.48 examples/s]
Add position_id column (Sample Packing) (num_proc=1
2: 92):  26%|β–ˆβ–ˆβ–‹       | 83460/315947 [00:03<00:06, 36406.83 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  28%|β–ˆβ–ˆβ–Š       | 89274/315947 [00:03<00:06, 34098.44 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 140344/315947 [00:03<00:01, 106033.74 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 159450/315947 [00:03<00:01, 117072.05 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 177618/315947 [00:03<00:01, 112166.72 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 193556/315947 [00:04<00:01, 115644.44 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 208494/315947 [00:04<00:00, 115935.54 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 222951/315947 [00:04<00:00, 118182.98 examples/s]
Add posit
2: ion_id column (Sample Packing) (num_proc=192):  75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 237276/315947 [00:04<00:00, 120116.28 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 251503/315947 [00:04<00:00, 123948.93 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 267246/315947 [00:04<00:00, 131860.70 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 282342/315947 [00:04<00:00, 130278.29 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 295952/315947 [00:04<00:00, 112389.92 examples/s]
Add position_id column (Sample Packing) (num_proc=192):  98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 308207/315947 [00:05<00:00, 89401.52 examples/s] 
Add position_id column (Sample Packing) (num_proc=192): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 315947/315947 [00:05<00:00, 54860.47 examples/s]
2: 
Saving the dataset (0/192 shards):   0%|          | 0/315947 [00:00<?, ? examples/s]
Saving the dataset (0/192 shards):   1%|          | 1646/315947 [00:01<05:27, 958.98 examples/s]
Saving the dataset (1/192 shards):   1%|          | 1646/315947 [00:01<05:27, 958.98 examples/s]
Saving the dataset (2/192 shards):   2%|▏         | 4938/315947 [00:01<05:24, 958.98 examples/s]
Saving the dataset (3/192 shards):   2%|▏         | 4938/315947 [00:01<05:24, 958.98 examples/s]
Saving the dataset (4/192 shards):   2%|▏         | 6584/315947 [00:01<05:22, 958.98 examples/s]
Saving the dataset (5/192 shards):   3%|β–Ž         | 8230/315947 [00:01<05:20, 958.98 examples/s]
Saving the dataset (6/192 shards):   3%|β–Ž         | 9876/315947 [00:01<05:19, 958.98 examples/s]
Saving the dataset (7/192 shards):   4%|β–Ž         | 11522/315947 [00:01<05:17, 958.98 examples/s]
Saving the dataset (8/192 shards):   4%|▍         | 13168/315947 [00:01<05:15, 958.98 examples/s]
Saving the dataset (9/192 shards):   5%|▍      
2:    | 14814/315947 [00:01<05:14, 958.98 examples/s]
Saving the dataset (10/192 shards):   5%|β–Œ         | 16460/315947 [00:01<05:12, 958.98 examples/s]
Saving the dataset (11/192 shards):   6%|β–Œ         | 18106/315947 [00:01<05:10, 958.98 examples/s]
Saving the dataset (12/192 shards):   6%|β–‹         | 19752/315947 [00:01<05:08, 958.98 examples/s]
Saving the dataset (13/192 shards):   7%|β–‹         | 23044/315947 [00:01<05:05, 958.98 examples/s]
Saving the dataset (14/192 shards):   7%|β–‹         | 23044/315947 [00:01<05:05, 958.98 examples/s]
Saving the dataset (15/192 shards):   8%|β–Š         | 24690/315947 [00:01<05:03, 958.98 examples/s]
Saving the dataset (16/192 shards):   8%|β–Š         | 26336/315947 [00:01<05:02, 958.98 examples/s]
Saving the dataset (17/192 shards):   9%|β–‰         | 27982/315947 [00:01<05:00, 958.98 examples/s]
Saving the dataset (18/192 shards):   9%|β–‰         | 29628/315947 [00:01<04:58, 958.98 examples/s]
Saving the dataset (19/192 shards):  10%|β–‰         | 31274/315
2: 947 [00:01<04:56, 958.98 examples/s]
Saving the dataset (20/192 shards):  10%|β–ˆ         | 32920/315947 [00:01<04:55, 958.98 examples/s]
Saving the dataset (21/192 shards):  11%|β–ˆβ–        | 36212/315947 [00:01<04:51, 958.98 examples/s]
Saving the dataset (22/192 shards):  11%|β–ˆβ–        | 36212/315947 [00:01<04:51, 958.98 examples/s]
Saving the dataset (23/192 shards):  12%|β–ˆβ–        | 37858/315947 [00:01<04:49, 958.98 examples/s]
Saving the dataset (24/192 shards):  13%|β–ˆβ–Ž        | 41150/315947 [00:01<04:46, 958.98 examples/s]
Saving the dataset (25/192 shards):  13%|β–ˆβ–Ž        | 41150/315947 [00:01<04:46, 958.98 examples/s]
Saving the dataset (26/192 shards):  14%|β–ˆβ–        | 44442/315947 [00:01<04:43, 958.98 examples/s]
Saving the dataset (27/192 shards):  14%|β–ˆβ–        | 44442/315947 [00:01<04:43, 958.98 examples/s]
Saving the dataset (28/192 shards):  15%|β–ˆβ–        | 46088/315947 [00:01<04:41, 958.98 examples/s]
Saving the dataset (29/192 shards):  15%|β–ˆβ–Œ        | 47734
2: /315947 [00:01<04:39, 958.98 examples/s]
Saving the dataset (30/192 shards):  16%|β–ˆβ–Œ        | 49380/315947 [00:01<04:37, 958.98 examples/s]
Saving the dataset (31/192 shards):  16%|β–ˆβ–Œ        | 51026/315947 [00:01<04:36, 958.98 examples/s]
Saving the dataset (32/192 shards):  17%|β–ˆβ–‹        | 52672/315947 [00:01<04:34, 958.98 examples/s]
Saving the dataset (33/192 shards):  18%|β–ˆβ–Š        | 55964/315947 [00:01<04:31, 958.98 examples/s]
Saving the dataset (34/192 shards):  18%|β–ˆβ–Š        | 55964/315947 [00:01<04:31, 958.98 examples/s]
Saving the dataset (35/192 shards):  18%|β–ˆβ–Š        | 57610/315947 [00:01<04:29, 958.98 examples/s]
Saving the dataset (36/192 shards):  19%|β–ˆβ–‰        | 60902/315947 [00:01<04:25, 958.98 examples/s]
Saving the dataset (37/192 shards):  19%|β–ˆβ–‰        | 60902/315947 [00:01<04:25, 958.98 examples/s]
Saving the dataset (38/192 shards):  20%|β–ˆβ–‰        | 62548/315947 [00:01<04:24, 958.98 examples/s]
Saving the dataset (39/192 shards):  20%|β–ˆβ–ˆ        |
2:  64194/315947 [00:01<04:22, 958.98 examples/s]
Saving the dataset (40/192 shards):  21%|β–ˆβ–ˆ        | 65840/315947 [00:01<04:20, 958.98 examples/s]
Saving the dataset (41/192 shards):  22%|β–ˆβ–ˆβ–       | 69132/315947 [00:01<04:17, 958.98 examples/s]
Saving the dataset (42/192 shards):  22%|β–ˆβ–ˆβ–       | 70778/315947 [00:01<04:15, 958.98 examples/s]
Saving the dataset (43/192 shards):  22%|β–ˆβ–ˆβ–       | 70778/315947 [00:01<04:15, 958.98 examples/s]
Saving the dataset (44/192 shards):  23%|β–ˆβ–ˆβ–Ž       | 72424/315947 [00:01<04:13, 958.98 examples/s]
Saving the dataset (45/192 shards):  24%|β–ˆβ–ˆβ–       | 75716/315947 [00:01<04:10, 958.98 examples/s]
Saving the dataset (46/192 shards):  24%|β–ˆβ–ˆβ–       | 75716/315947 [00:01<04:10, 958.98 examples/s]
Saving the dataset (47/192 shards):  24%|β–ˆβ–ˆβ–       | 77362/315947 [00:01<04:08, 958.98 examples/s]
Saving the dataset (48/192 shards):  26%|β–ˆβ–ˆβ–Œ       | 80654/315947 [00:01<04:05, 958.98 examples/s]
Saving the dataset (49/192 shards)
2: :  26%|β–ˆβ–ˆβ–Œ       | 80654/315947 [00:01<04:05, 958.98 examples/s]
Saving the dataset (50/192 shards):  26%|β–ˆβ–ˆβ–Œ       | 82300/315947 [00:01<04:03, 958.98 examples/s]
Saving the dataset (51/192 shards):  27%|β–ˆβ–ˆβ–‹       | 83946/315947 [00:01<04:01, 958.98 examples/s]
Saving the dataset (52/192 shards):  27%|β–ˆβ–ˆβ–‹       | 85592/315947 [00:01<04:00, 958.98 examples/s]
Saving the dataset (53/192 shards):  28%|β–ˆβ–ˆβ–Š       | 87238/315947 [00:01<03:58, 958.98 examples/s]
Saving the dataset (54/192 shards):  28%|β–ˆβ–ˆβ–Š       | 88884/315947 [00:01<03:56, 958.98 examples/s]
Saving the dataset (55/192 shards):  29%|β–ˆβ–ˆβ–Š       | 90530/315947 [00:01<03:55, 958.98 examples/s]
Saving the dataset (56/192 shards):  30%|β–ˆβ–ˆβ–‰       | 93822/315947 [00:01<03:51, 958.98 examples/s]
Saving the dataset (57/192 shards):  30%|β–ˆβ–ˆβ–ˆ       | 95468/315947 [00:01<03:49, 958.98 examples/s]
Saving the dataset (58/192 shards):  30%|β–ˆβ–ˆβ–ˆ       | 95468/315947 [00:01<03:49, 958.98 examples/s]
Saving t
2: he dataset (59/192 shards):  31%|β–ˆβ–ˆβ–ˆβ–      | 98760/315947 [00:01<03:46, 958.98 examples/s]
Saving the dataset (60/192 shards):  31%|β–ˆβ–ˆβ–ˆβ–      | 98760/315947 [00:01<03:46, 958.98 examples/s]
Saving the dataset (61/192 shards):  32%|β–ˆβ–ˆβ–ˆβ–      | 100406/315947 [00:01<03:44, 958.98 examples/s]
Saving the dataset (62/192 shards):  32%|β–ˆβ–ˆβ–ˆβ–      | 102052/315947 [00:01<03:43, 958.98 examples/s]
Saving the dataset (63/192 shards):  33%|β–ˆβ–ˆβ–ˆβ–Ž      | 103698/315947 [00:01<03:41, 958.98 examples/s]
Saving the dataset (64/192 shards):  33%|β–ˆβ–ˆβ–ˆβ–Ž      | 105344/315947 [00:01<03:39, 958.98 examples/s]
Saving the dataset (65/192 shards):  34%|β–ˆβ–ˆβ–ˆβ–      | 108636/315947 [00:01<03:36, 958.98 examples/s]
Saving the dataset (66/192 shards):  34%|β–ˆβ–ˆβ–ˆβ–      | 108636/315947 [00:01<03:36, 958.98 examples/s]
Saving the dataset (67/192 shards):  35%|β–ˆβ–ˆβ–ˆβ–      | 110282/315947 [00:01<03:34, 958.98 examples/s]
Saving the dataset (68/192 shards):  35%|β–ˆβ–ˆβ–ˆβ–Œ      | 1
2: 11928/315947 [00:01<03:32, 958.98 examples/s]
Saving the dataset (69/192 shards):  36%|β–ˆβ–ˆβ–ˆβ–Œ      | 113574/315947 [00:01<03:31, 958.98 examples/s]
Saving the dataset (70/192 shards):  36%|β–ˆβ–ˆβ–ˆβ–‹      | 115220/315947 [00:01<03:29, 958.98 examples/s]
Saving the dataset (71/192 shards):  37%|β–ˆβ–ˆβ–ˆβ–‹      | 116866/315947 [00:01<03:27, 958.98 examples/s]
Saving the dataset (72/192 shards):  38%|β–ˆβ–ˆβ–ˆβ–Š      | 118512/315947 [00:01<03:25, 958.98 examples/s]
Saving the dataset (73/192 shards):  38%|β–ˆβ–ˆβ–ˆβ–Š      | 120158/315947 [00:01<03:24, 958.98 examples/s]
Saving the dataset (74/192 shards):  39%|β–ˆβ–ˆβ–ˆβ–Š      | 121804/315947 [00:01<03:22, 958.98 examples/s]
Saving the dataset (75/192 shards):  39%|β–ˆβ–ˆβ–ˆβ–‰      | 123450/315947 [00:01<03:20, 958.98 examples/s]
Saving the dataset (76/192 shards):  40%|β–ˆβ–ˆβ–ˆβ–‰      | 125096/315947 [00:01<03:19, 958.98 examples/s]
Saving the dataset (77/192 shards):  40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 126742/315947 [00:01<03:17, 958.98 examples/s]
Saving
2:  the dataset (78/192 shards):  41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 130034/315947 [00:01<03:13, 958.98 examples/s]
Saving the dataset (79/192 shards):  42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 131680/315947 [00:01<03:12, 958.98 examples/s]
Saving the dataset (80/192 shards):  42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 131680/315947 [00:01<03:12, 958.98 examples/s]
Saving the dataset (81/192 shards):  42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 133326/315947 [00:01<03:10, 958.98 examples/s]
Saving the dataset (82/192 shards):  43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 134972/315947 [00:01<03:08, 958.98 examples/s]
Saving the dataset (83/192 shards):  43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 136618/315947 [00:01<03:07, 958.98 examples/s]
Saving the dataset (84/192 shards):  44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 138264/315947 [00:01<03:05, 958.98 examples/s]
Saving the dataset (85/192 shards):  44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 139910/315947 [00:01<03:03, 958.98 examples/s]
Saving the dataset (86/192 shards):  45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 141556/315947 [00:01<03:01, 958.98 examples/s]
Saving the dataset (87/192 shards):  45%|οΏ½
2: οΏ½οΏ½β–ˆβ–ˆβ–ˆβ–Œ     | 143202/315947 [00:01<03:00, 958.98 examples/s]
Saving the dataset (88/192 shards):  46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 144848/315947 [00:01<02:58, 958.98 examples/s]
Saving the dataset (89/192 shards):  46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 146494/315947 [00:01<02:56, 958.98 examples/s]
Saving the dataset (90/192 shards):  47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 148140/315947 [00:01<02:54, 958.98 examples/s]
Saving the dataset (91/192 shards):  48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 151432/315947 [00:01<02:51, 958.98 examples/s]
Saving the dataset (92/192 shards):  48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 151432/315947 [00:01<02:51, 958.98 examples/s]
Saving the dataset (93/192 shards):  48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 153078/315947 [00:01<02:49, 958.98 examples/s]
Saving the dataset (94/192 shards):  49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 154724/315947 [00:01<02:48, 958.98 examples/s]
Saving the dataset (95/192 shards):  49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 156370/315947 [00:01<02:46, 958.98 examples/s]
Saving the dataset (96/192 shards):  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 158016/315947
2:  [00:01<02:44, 958.98 examples/s]
Saving the dataset (97/192 shards):  51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 159662/315947 [00:01<02:42, 958.98 examples/s]
Saving the dataset (98/192 shards):  51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 161308/315947 [00:01<02:41, 958.98 examples/s]
Saving the dataset (99/192 shards):  52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 162954/315947 [00:01<02:39, 958.98 examples/s]
Saving the dataset (100/192 shards):  52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 164600/315947 [00:01<02:37, 958.98 examples/s]
Saving the dataset (101/192 shards):  53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 166246/315947 [00:01<02:36, 958.98 examples/s]
Saving the dataset (102/192 shards):  53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 167892/315947 [00:01<02:34, 958.98 examples/s]
Saving the dataset (103/192 shards):  54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 169538/315947 [00:01<02:32, 958.98 examples/s]
Saving the dataset (104/192 shards):  55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 172830/315947 [00:01<02:29, 958.98 examples/s]
Saving the dataset (105/192 shards):  55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 172830/315947 [00:01<02:29,
2:  958.98 examples/s]
Saving the dataset (106/192 shards):  55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 174476/315947 [00:01<02:27, 958.98 examples/s]
Saving the dataset (107/192 shards):  56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 176122/315947 [00:01<02:25, 958.98 examples/s]
Saving the dataset (108/192 shards):  57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 181057/315947 [00:01<02:20, 958.98 examples/s]
Saving the dataset (109/192 shards):  57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 181057/315947 [00:01<02:20, 958.98 examples/s]
Saving the dataset (110/192 shards):  57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 181057/315947 [00:01<02:20, 958.98 examples/s]
Saving the dataset (111/192 shards):  58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 182702/315947 [00:01<02:18, 958.98 examples/s]
Saving the dataset (112/192 shards):  59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 185992/315947 [00:01<02:15, 958.98 examples/s]
Saving the dataset (113/192 shards):  59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 185992/315947 [00:01<02:15, 958.98 examples/s]
Saving the dataset (114/192 shards):  59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 187637/315947 [00:01<02:13, 958.98
2:  examples/s]
Saving the dataset (115/192 shards):  60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 189282/315947 [00:01<02:12, 958.98 examples/s]
Saving the dataset (116/192 shards):  60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 190927/315947 [00:01<02:10, 958.98 examples/s]
Saving the dataset (117/192 shards):  61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 192572/315947 [00:01<02:08, 958.98 examples/s]
Saving the dataset (118/192 shards):  61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 194217/315947 [00:01<02:06, 958.98 examples/s]
Saving the dataset (119/192 shards):  62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 195862/315947 [00:01<02:05, 958.98 examples/s]
Saving the dataset (120/192 shards):  63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 197507/315947 [00:01<02:03, 958.98 examples/s]
Saving the dataset (121/192 shards):  63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 199152/315947 [00:01<02:01, 958.98 examples/s]
Saving the dataset (122/192 shards):  64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 200797/315947 [00:01<02:00, 958.98 examples/s]
Saving the dataset (123/192 shards):  64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 202442/315947 [00:01<01:58, 9
2: 58.98 examples/s]
Saving the dataset (124/192 shards):  65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 205732/315947 [00:01<01:54, 958.98 examples/s]
Saving the dataset (125/192 shards):  66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 207377/315947 [00:01<01:53, 958.98 examples/s]
Saving the dataset (126/192 shards):  66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 207377/315947 [00:01<01:53, 958.98 examples/s]
Saving the dataset (127/192 shards):  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 210667/315947 [00:01<01:49, 958.98 examples/s]
Saving the dataset (128/192 shards):  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 210667/315947 [00:01<01:49, 958.98 examples/s]
Saving the dataset (129/192 shards):  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 212312/315947 [00:01<01:48, 958.98 examples/s]
Saving the dataset (130/192 shards):  68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 213957/315947 [00:01<01:46, 958.98 examples/s]
Saving the dataset (131/192 shards):  69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 217247/315947 [00:01<01:42, 958.98 examples/s]
Saving the dataset (132/192 shards):  69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 217247/315947 [00:
2: 01<01:42, 958.98 examples/s]
Saving the dataset (133/192 shards):  69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 218892/315947 [00:01<01:41, 958.98 examples/s]
Saving the dataset (134/192 shards):  70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 220537/315947 [00:01<01:39, 958.98 examples/s]
Saving the dataset (135/192 shards):  70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 222182/315947 [00:01<01:37, 958.98 examples/s]
Saving the dataset (136/192 shards):  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 223827/315947 [00:01<01:36, 958.98 examples/s]
Saving the dataset (137/192 shards):  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 225472/315947 [00:01<01:34, 958.98 examples/s]
Saving the dataset (138/192 shards):  72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 227117/315947 [00:01<01:32, 958.98 examples/s]
Saving the dataset (139/192 shards):  72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 228762/315947 [00:01<01:30, 958.98 examples/s]
Saving the dataset (140/192 shards):  73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 230407/315947 [00:01<01:29, 958.98 examples/s]
Saving the dataset (141/192 shards):  73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž 
2:  | 232052/315947 [00:01<01:27, 958.98 examples/s]
Saving the dataset (142/192 shards):  74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 233697/315947 [00:01<01:25, 958.98 examples/s]
Saving the dataset (143/192 shards):  74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 235342/315947 [00:01<01:24, 958.98 examples/s]
Saving the dataset (144/192 shards):  75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 236987/315947 [00:01<01:22, 958.98 examples/s]
Saving the dataset (145/192 shards):  76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 238632/315947 [00:01<01:20, 958.98 examples/s]
Saving the dataset (146/192 shards):  76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 240277/315947 [00:01<01:18, 958.98 examples/s]
Saving the dataset (147/192 shards):  77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 243567/315947 [00:01<01:15, 958.98 examples/s]
Saving the dataset (148/192 shards):  77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 243567/315947 [00:01<01:15, 958.98 examples/s]
Saving the dataset (149/192 shards):  78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 245212/315947 [00:01<01:13, 958.98 examples/s]
Saving the dataset (150/192 shards):  
2: 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 246857/315947 [00:01<01:12, 958.98 examples/s]
Saving the dataset (151/192 shards):  79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 248502/315947 [00:01<01:10, 958.98 examples/s]
Saving the dataset (152/192 shards):  80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 251792/315947 [00:01<01:06, 958.98 examples/s]
Saving the dataset (153/192 shards):  80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 251792/315947 [00:01<01:06, 958.98 examples/s]
Saving the dataset (154/192 shards):  80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 253437/315947 [00:01<01:05, 958.98 examples/s]
Saving the dataset (155/192 shards):  81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 255082/315947 [00:01<01:03, 958.98 examples/s]
Saving the dataset (156/192 shards):  82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 258372/315947 [00:01<01:00, 958.98 examples/s]
Saving the dataset (157/192 shards):  82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 258372/315947 [00:01<01:00, 958.98 examples/s]
Saving the dataset (158/192 shards):  82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 260017/315947 [00:01<00:58, 958.98 examples/s]
Sav
2: ing the dataset (159/192 shards):  83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 261662/315947 [00:01<00:56, 958.98 examples/s]
Saving the dataset (160/192 shards):  83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 263307/315947 [00:01<00:54, 958.98 examples/s]
Saving the dataset (161/192 shards):  84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 264952/315947 [00:01<00:53, 958.98 examples/s]
Saving the dataset (162/192 shards):  84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 266597/315947 [00:01<00:51, 958.98 examples/s]
Saving the dataset (163/192 shards):  85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 268242/315947 [00:01<00:49, 958.98 examples/s]
Saving the dataset (164/192 shards):  85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 269887/315947 [00:01<00:48, 958.98 examples/s]
Saving the dataset (165/192 shards):  86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 273177/315947 [00:01<00:44, 958.98 examples/s]
Saving the dataset (166/192 shards):  86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 273177/315947 [00:01<00:44, 958.98 examples/s]
Saving the dataset (167/192 shards):  87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 274
2: 822/315947 [00:01<00:42, 958.98 examples/s]
Saving the dataset (168/192 shards):  88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 276467/315947 [00:01<00:41, 958.98 examples/s]
Saving the dataset (169/192 shards):  88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 278112/315947 [00:01<00:39, 958.98 examples/s]
Saving the dataset (170/192 shards):  89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 279757/315947 [00:01<00:37, 958.98 examples/s]
Saving the dataset (171/192 shards):  89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 281402/315947 [00:01<00:36, 958.98 examples/s]
Saving the dataset (172/192 shards):  90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 283047/315947 [00:01<00:34, 958.98 examples/s]
Saving the dataset (173/192 shards):  90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 284692/315947 [00:01<00:32, 958.98 examples/s]
Saving the dataset (174/192 shards):  91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 286337/315947 [00:01<00:30, 958.98 examples/s]
Saving the dataset (175/192 shards):  92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 289627/315947 [00:01<00:27, 958.98 examples/s]
Saving the dataset (176/19
2: 2 shards):  92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 289627/315947 [00:01<00:27, 958.98 examples/s]
Saving the dataset (177/192 shards):  93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 292917/315947 [00:01<00:24, 958.98 examples/s]
Saving the dataset (178/192 shards):  93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 292917/315947 [00:01<00:24, 958.98 examples/s]
Saving the dataset (179/192 shards):  93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 294562/315947 [00:01<00:22, 958.98 examples/s]
Saving the dataset (180/192 shards):  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 296207/315947 [00:01<00:20, 958.98 examples/s]
Saving the dataset (181/192 shards):  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 297852/315947 [00:01<00:18, 958.98 examples/s]
Saving the dataset (182/192 shards):  95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 301142/315947 [00:01<00:15, 958.98 examples/s]
Saving the dataset (183/192 shards):  95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 301142/315947 [00:01<00:15, 958.98 examples/s]
Saving the dataset (184/192 shards):  96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 304432/3
2: 15947 [00:01<00:12, 958.98 examples/s]
Saving the dataset (185/192 shards):  96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 304432/315947 [00:01<00:12, 958.98 examples/s]
Saving the dataset (186/192 shards):  97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 306077/315947 [00:01<00:10, 958.98 examples/s]
Saving the dataset (187/192 shards):  97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 307722/315947 [00:01<00:08, 958.98 examples/s]
Saving the dataset (188/192 shards):  98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 309367/315947 [00:01<00:06, 958.98 examples/s]
Saving the dataset (189/192 shards):  98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 311012/315947 [00:01<00:05, 958.98 examples/s]
Saving the dataset (190/192 shards):  99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 312657/315947 [00:01<00:03, 958.98 examples/s]
Saving the dataset (191/192 shards):  99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 314302/315947 [00:01<00:01, 958.98 examples/s]
Saving the dataset (192/192 shards): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 315947/315947 [00:01<00:00, 958.98 examples/s]
Saving the datase
2: t (192/192 shards): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 315947/315947 [00:01<00:00, 169075.72 examples/s]
0: [2025-09-02 18:43:32,409] [INFO] [axolotl.utils.data.shared.load_preprocessed_dataset:472] [PID:1478787] [RANK:0] Loading prepared dataset from disk at /lustre/fsn1/projects/rech/dgo/udv55np/dataset_math/Qwen3-235B-A22B/0/b1771c7e92212c2fb90b5a0bac7a225c...
0: [2025-09-02 18:45:12,633] [INFO] [axolotl.utils.samplers.multipack.calc_min_len:436] [PID:1478787] [RANK:0] gather_len_batches: [25939, 25939, 25938, 25939, 25939, 25938, 25940, 25938, 25939, 25939, 25939, 25937, 25939, 25940, 25939, 25938]
0: [2025-09-02 18:45:12,660] [INFO] [axolotl.utils.trainer.calc_sample_packing_eff_est:495] [PID:1478787] [RANK:0] sample_packing_eff_est across ranks: [0.9965550303459167, 0.9965550303459167, 0.9964781999588013, 0.9965550303459167, 0.9965166449546814, 0.9965934753417969, 0.9965550303459167, 0.9965166449546814, 0.9965934753417969, 0.9965550303459167, 0.9965934753417969, 0.9965934753417969, 0.9965550303459167, 0.9965550303459167, 0.9965934753417969, 0.9965934753417969]
0: [2025-09-02 18:45:12,665] [INFO] [axolotl.utils.data.sft._prepare_standard_dataset:127] [PID:1478787] [RANK:0] Maximum number of steps set at 1621
0: [2025-09-02 18:45:12,990] [INFO] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_evaluation_loop:110] [PID:1478787] [RANK:0] Patched Trainer.evaluation_loop with nanmean loss calculation
0: [2025-09-02 18:45:12,991] [INFO] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_maybe_log_save_evaluate:164] [PID:1478787] [RANK:0] Patched Trainer._maybe_log_save_evaluate with nanmean loss calculation
0: 
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:14<00:14, 14.97s/it]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:14<00:14, 14.97s/it]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:14<00:14, 14.97s/it]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:15<00:15, 15.06s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.51s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.63s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.51s/it]
0: 
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.51s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.63s/it]
3: 
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:14<00:14, 14.99s/it]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:14<00:14, 14.99s/it]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:14<00:14, 14.98s/it]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:14<00:14, 14.99s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.52s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.52s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.52s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.64s/it]
1: 
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:14<00:14, 14.97s/it]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:14<00:14, 14.97s/it]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:14<00:14, 14.97s/it]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:14<00:14, 14.98s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.51s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.51s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.51s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.63s/it]
2: 
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:15<00:15, 15.02s/it]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:15<00:15, 15.02s/it]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:15<00:15, 15.01s/it]
Loading checkpoint shards:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 1/2 [00:15<00:15, 15.01s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.53s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.65s/it]
0: 
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.63s/it]
3: 
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.64s/it]
1: 
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.63s/it]
1: 
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.63s/it]
2: 
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.53s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.53s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.65s/it]
3: 
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.64s/it]
2: 
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.65s/it]
1: 
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.51s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.63s/it]
3: 
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.51s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.63s/it]
2: 
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.53s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.65s/it]
0: 
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  7.55s/it]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00,  8.67s/it]
0: [2025-09-02 18:45:30,643] [INFO] [axolotl.loaders.model._configure_embedding_dtypes:345] [PID:1478787] [RANK:0] Converting modules to torch.bfloat16
0: [2025-09-02 18:45:39,599] [INFO] [axolotl.train.save_initial_configs:416] [PID:1478787] [RANK:0] Pre-saving tokenizer to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0...
0: [2025-09-02 18:45:39,769] [INFO] [axolotl.train.save_initial_configs:419] [PID:1478787] [RANK:0] Pre-saving model config to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0...
0: [2025-09-02 18:45:39,777] [INFO] [axolotl.train.execute_training:203] [PID:1478787] [RANK:0] Starting trainer...
0: [2025-09-02 18:47:46,046] [INFO] [axolotl.utils.samplers.multipack.calc_min_len:436] [PID:1478787] [RANK:0] gather_len_batches: [25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939, 25939]
0: Parameter Offload - Persistent parameters statistics: param_count = 181, numel = 241664
0: {'loss': 0.293, 'grad_norm': 0.3438611558851885, 'learning_rate': 7.7e-07, 'memory/max_mem_active(gib)': 35.16, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 41.77, 'epoch': 0.01}
0: 
  0%|          | 0/1621 [00:00<?, ?it/s]
  0%|          | 1/1621 [03:16<88:16:18, 196.16s/it]
  0%|          | 2/1621 [03:18<37:02:55, 82.38s/it] 
  0%|          | 3/1621 [03:20<20:24:21, 45.40s/it]
  0%|          | 4/1621 [03:21<12:35:47, 28.04s/it]
  0%|          | 5/1621 [03:23<8:16:55, 18.45s/it] 
  0%|          | 6/1621 [03:24<5:40:48, 12.66s/it]
  0%|          | 7/1621 [03:26<4:01:45,  8.99s/it]
  0%|          | 8/1621 [03:27<2:58:05,  6.62s/it]
  1%|          | 9/1621 [03:29<2:14:25,  5.00s/it]
  1%|          | 10/1621 [03:30<1:45:26,  3.93s/it]
                                                   

  1%|          | 10/1621 [03:30<1:45:26,  3.93s/it]
  1%|          | 11/1621 [03:31<1:25:08,  3.17s/it]
  1%|          | 12/1621 [03:33<1:10:52,  2.64s/it]
  1%|          | 13/1621 [03:34<1:01:40,  2.30s/it]
  1%|          | 14/1621 [03:36<54:29,  2.03s/it]  
  1%|          | 15/1621 [03:37<49:26,  1.85s/it]
  1%|          | 16/1621 [03:39<46:23,  1.73s/it]
  1%|          | 17/1621 [03:40<43:46,  1.64s/it]
 
0: {'loss': 0.2902, 'grad_norm': 0.32969781245683594, 'learning_rate': 1.0700000000000001e-06, 'memory/max_mem_active(gib)': 35.16, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 41.77, 'epoch': 0.01}
0: {'loss': 0.2916, 'grad_norm': 0.2957381320197924, 'learning_rate': 1.3700000000000002e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.02}
0:  1%|          | 18/1621 [03:42<42:43,  1.60s/it]
  1%|          | 19/1621 [03:43<42:48,  1.60s/it]
  1%|          | 20/1621 [03:45<41:24,  1.55s/it]
                                                 

  1%|          | 20/1621 [03:45<41:24,  1.55s/it]
  1%|▏         | 21/1621 [03:47<43:38,  1.64s/it]
  1%|▏         | 22/1621 [03:48<41:53,  1.57s/it]
  1%|▏         | 23/1621 [03:49<40:35,  1.52s/it]
  1%|▏         | 24/1621 [03:51<39:54,  1.50s/it]
  2%|▏         | 25/1621 [03:52<39:20,  1.48s/it]
  2%|▏         | 26/1621 [03:54<38:55,  1.46s/it]
  2%|▏         | 27/1621 [03:55<38:35,  1.45s/it]
  2%|▏         | 28/1621 [03:57<40:41,  1.53s/it]
  2%|▏         | 29/1621 [03:58<39:43,  1.50s/it]
  2%|▏         | 30/1621 [04:00<40:17,  1.52s/it]
                                                 

  2%|▏         | 30/1621 [04:00<40:17,  1.52s/it]
  2%|▏         | 31/1621 [04:01<40:17,  1.52s/it]
  2%|▏         | 32/1621 [04:03<39:46,  1.50s/it]
  2%|▏         | 33/1621 [04:04<39:13,  1.48s
0: {'loss': 0.2879, 'grad_norm': 0.7741302157876179, 'learning_rate': 1.6700000000000003e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.02}
0: /it]
  2%|▏         | 34/1621 [04:06<39:51,  1.51s/it]
  2%|▏         | 35/1621 [04:07<39:09,  1.48s/it]
  2%|▏         | 36/1621 [04:09<38:39,  1.46s/it]
  2%|▏         | 37/1621 [04:10<38:28,  1.46s/it]
  2%|▏         | 38/1621 [04:12<38:11,  1.45s/it]
  2%|▏         | 39/1621 [04:13<38:00,  1.44s/it]
  2%|▏         | 40/1621 [04:14<38:20,  1.45s/it]
                                                 

  2%|▏         | 40/1621 [04:14<38:20,  1.45s/it]
  3%|β–Ž         | 41/1621 [04:16<38:05,  1.45s/it]
  3%|β–Ž         | 42/1621 [04:17<38:07,  1.45s/it]
  3%|β–Ž         | 43/1621 [04:19<38:02,  1.45s/it]
  3%|β–Ž         | 44/1621 [04:20<38:23,  1.46s/it]
  3%|β–Ž         | 45/1621 [04:22<40:59,  1.56s/it]
  3%|β–Ž         | 46/1621 [04:24<40:34,  1.55s/it]
  3%|β–Ž         | 47/1621 [04:25<40:17,  1.54s/it]
  3%|β–Ž         | 48/1621 [04:26<39:20,  1.50s/it]
  3%|β–Ž         | 49/1621 [04:28<38:39,  1.48s/it]
  3%|β–Ž         | 50/1621 [04:29<38:14,  1.46s/it]
                                
0: {'loss': 0.2893, 'grad_norm': 0.30490400114093213, 'learning_rate': 1.97e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.03}
0: {'loss': 0.2882, 'grad_norm': 0.299265789503041, 'learning_rate': 2.27e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.04}
0:                  

  3%|β–Ž         | 50/1621 [04:29<38:14,  1.46s/it]
  3%|β–Ž         | 51/1621 [04:31<37:56,  1.45s/it]
  3%|β–Ž         | 52/1621 [04:32<37:40,  1.44s/it]
  3%|β–Ž         | 53/1621 [04:34<40:10,  1.54s/it]
  3%|β–Ž         | 54/1621 [04:35<39:27,  1.51s/it]
  3%|β–Ž         | 55/1621 [04:37<38:36,  1.48s/it]
  3%|β–Ž         | 56/1621 [04:38<38:04,  1.46s/it]
  4%|β–Ž         | 57/1621 [04:40<37:45,  1.45s/it]
  4%|β–Ž         | 58/1621 [04:41<37:26,  1.44s/it]
  4%|β–Ž         | 59/1621 [04:42<37:16,  1.43s/it]
  4%|β–Ž         | 60/1621 [04:44<37:05,  1.43s/it]
                                                 

  4%|β–Ž         | 60/1621 [04:44<37:05,  1.43s/it]
  4%|▍         | 61/1621 [04:45<37:02,  1.42s/it]
  4%|▍         | 62/1621 [04:47<36:56,  1.42s/it]
  4%|▍         | 63/1621 [04:48<36:46,  1.42s/it]
  4%|▍         | 64/1621 [04:50<36:50,  1.42s/it]
  4%|▍         | 65/1621 [04:51<36:52,  1.42s/it]
  4%|▍         | 66/1621 [04:52<36:47,  1.42s/it]
  4%|▍         |
0: {'loss': 0.2852, 'grad_norm': 0.3075656415183605, 'learning_rate': 2.5700000000000004e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.04}
0: {'loss': 0.2809, 'grad_norm': 0.2915611347823235, 'learning_rate': 2.87e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.05}
0:  67/1621 [04:54<36:53,  1.42s/it]
  4%|▍         | 68/1621 [04:55<36:52,  1.42s/it]
  4%|▍         | 69/1621 [04:57<36:55,  1.43s/it]
  4%|▍         | 70/1621 [04:58<37:38,  1.46s/it]
                                                 

  4%|▍         | 70/1621 [04:58<37:38,  1.46s/it]
  4%|▍         | 71/1621 [05:00<37:59,  1.47s/it]
  4%|▍         | 72/1621 [05:01<37:34,  1.46s/it]
  5%|▍         | 73/1621 [05:03<37:12,  1.44s/it]
  5%|▍         | 74/1621 [05:04<37:25,  1.45s/it]
  5%|▍         | 75/1621 [05:05<37:39,  1.46s/it]
  5%|▍         | 76/1621 [05:07<37:20,  1.45s/it]
  5%|▍         | 77/1621 [05:08<38:22,  1.49s/it]
  5%|▍         | 78/1621 [05:10<37:41,  1.47s/it]
  5%|▍         | 79/1621 [05:11<37:28,  1.46s/it]
  5%|▍         | 80/1621 [05:13<37:56,  1.48s/it]
                                                 

  5%|▍         | 80/1621 [05:13<37:56,  1.48s/it]
  5%|▍         | 81/1621 [05:14<37:23,  1.46s/it]
  5%|β–Œ         | 82/1621 [05:16<37:36,  1.47s/it]
  5%
0: {'loss': 0.2866, 'grad_norm': 0.3188606948350198, 'learning_rate': 3.17e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.06}
0: |β–Œ         | 83/1621 [05:17<37:10,  1.45s/it]
  5%|β–Œ         | 84/1621 [05:19<36:55,  1.44s/it]
  5%|β–Œ         | 85/1621 [05:20<36:49,  1.44s/it]
  5%|β–Œ         | 86/1621 [05:21<36:38,  1.43s/it]
  5%|β–Œ         | 87/1621 [05:23<36:44,  1.44s/it]
  5%|β–Œ         | 88/1621 [05:24<36:29,  1.43s/it]
  5%|β–Œ         | 89/1621 [05:26<36:19,  1.42s/it]
  6%|β–Œ         | 90/1621 [05:27<36:49,  1.44s/it]
                                                 

  6%|β–Œ         | 90/1621 [05:27<36:49,  1.44s/it]
  6%|β–Œ         | 91/1621 [05:29<36:33,  1.43s/it]
  6%|β–Œ         | 92/1621 [05:30<36:45,  1.44s/it]
  6%|β–Œ         | 93/1621 [05:32<37:16,  1.46s/it]
  6%|β–Œ         | 94/1621 [05:33<36:47,  1.45s/it]
  6%|β–Œ         | 95/1621 [05:34<36:28,  1.43s/it]
  6%|β–Œ         | 96/1621 [05:36<36:20,  1.43s/it]
  6%|β–Œ         | 97/1621 [05:38<38:52,  1.53s/it]
  6%|β–Œ         | 98/1621 [05:39<38:14,  1.51s/it]
  6%|β–Œ         | 99/1621 [05:40<37:28,  1.48s/it]
  6%|β–Œ         | 100/1621 [05:42<36:55,
0: {'loss': 0.2892, 'grad_norm': 0.3196600090952391, 'learning_rate': 3.4700000000000007e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.06}
0: {'loss': 0.2886, 'grad_norm': 0.3064739256293986, 'learning_rate': 3.7700000000000003e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.07}
0:   1.46s/it]
                                                  

  6%|β–Œ         | 100/1621 [05:42<36:55,  1.46s/it]
  6%|β–Œ         | 101/1621 [05:43<37:06,  1.46s/it]
  6%|β–‹         | 102/1621 [05:45<36:37,  1.45s/it]
  6%|β–‹         | 103/1621 [05:46<36:27,  1.44s/it]
  6%|β–‹         | 104/1621 [05:48<36:37,  1.45s/it]
  6%|β–‹         | 105/1621 [05:49<36:16,  1.44s/it]
  7%|β–‹         | 106/1621 [05:50<36:26,  1.44s/it]
  7%|β–‹         | 107/1621 [05:52<36:24,  1.44s/it]
  7%|β–‹         | 108/1621 [05:53<36:24,  1.44s/it]
  7%|β–‹         | 109/1621 [05:55<36:33,  1.45s/it]
  7%|β–‹         | 110/1621 [05:56<36:19,  1.44s/it]
                                                  

  7%|β–‹         | 110/1621 [05:56<36:19,  1.44s/it]
  7%|β–‹         | 111/1621 [05:58<36:14,  1.44s/it]
  7%|β–‹         | 112/1621 [05:59<35:59,  1.43s/it]
  7%|β–‹         | 113/1621 [06:01<36:09,  1.44s/it]
  7%|β–‹         | 114/1621 [06:02<36:03,  1.44s/it]
  7%|β–‹         | 115/1621 [06:03<35:50,  1.43s/it]
  7%|οΏ½
0: {'loss': 0.2892, 'grad_norm': 0.4347169189893856, 'learning_rate': 4.07e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.07}
0: {'loss': 0.2864, 'grad_norm': 0.2939409486111771, 'learning_rate': 4.3700000000000005e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.08}
0: οΏ½         | 116/1621 [06:05<35:58,  1.43s/it]
  7%|β–‹         | 117/1621 [06:06<35:45,  1.43s/it]
  7%|β–‹         | 118/1621 [06:08<37:32,  1.50s/it]
  7%|β–‹         | 119/1621 [06:09<36:53,  1.47s/it]
  7%|β–‹         | 120/1621 [06:11<39:52,  1.59s/it]
                                                  

  7%|β–‹         | 120/1621 [06:11<39:52,  1.59s/it]
  7%|β–‹         | 121/1621 [06:13<38:33,  1.54s/it]
  8%|β–Š         | 122/1621 [06:14<38:12,  1.53s/it]
  8%|β–Š         | 123/1621 [06:16<37:29,  1.50s/it]
  8%|β–Š         | 124/1621 [06:17<38:19,  1.54s/it]
  8%|β–Š         | 125/1621 [06:19<37:40,  1.51s/it]
  8%|β–Š         | 126/1621 [06:20<36:54,  1.48s/it]
  8%|β–Š         | 127/1621 [06:22<37:21,  1.50s/it]
  8%|β–Š         | 128/1621 [06:23<36:45,  1.48s/it]
  8%|β–Š         | 129/1621 [06:24<36:20,  1.46s/it]
  8%|β–Š         | 130/1621 [06:26<36:32,  1.47s/it]
                                                  

  8%|β–Š         | 130/1621 [06:26<36:32,  1.47s/it]
  8%|β–Š         | 131/162
0: {'loss': 0.281, 'grad_norm': 0.3037682241274547, 'learning_rate': 4.67e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.09}
0: 1 [06:28<38:40,  1.56s/it]
  8%|β–Š         | 132/1621 [06:29<38:01,  1.53s/it]
  8%|β–Š         | 133/1621 [06:31<37:03,  1.49s/it]
  8%|β–Š         | 134/1621 [06:32<36:20,  1.47s/it]
  8%|β–Š         | 135/1621 [06:33<35:58,  1.45s/it]
  8%|β–Š         | 136/1621 [06:35<35:44,  1.44s/it]
  8%|β–Š         | 137/1621 [06:36<35:36,  1.44s/it]
  9%|β–Š         | 138/1621 [06:38<35:34,  1.44s/it]
  9%|β–Š         | 139/1621 [06:39<35:27,  1.44s/it]
  9%|β–Š         | 140/1621 [06:41<36:24,  1.47s/it]
                                                  

  9%|β–Š         | 140/1621 [06:41<36:24,  1.47s/it]
  9%|β–Š         | 141/1621 [06:42<36:13,  1.47s/it]
  9%|β–‰         | 142/1621 [06:44<35:42,  1.45s/it]
  9%|β–‰         | 143/1621 [06:45<35:25,  1.44s/it]
  9%|β–‰         | 144/1621 [06:46<35:41,  1.45s/it]
  9%|β–‰         | 145/1621 [06:48<35:31,  1.44s/it]
  9%|β–‰         | 146/1621 [06:49<35:13,  1.43s/it]
  9%|β–‰         | 147/1621 [06:51<35:00,  1.43s/it]
  9%|β–‰         | 148/1621 [06:52<34:49,  1
0: {'loss': 0.2872, 'grad_norm': 0.31896588075238047, 'learning_rate': 4.970000000000001e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.09}
0: {'loss': 0.2745, 'grad_norm': 0.3079236544412073, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.1}
0: .42s/it]
  9%|β–‰         | 149/1621 [06:54<35:19,  1.44s/it]
  9%|β–‰         | 150/1621 [06:55<35:03,  1.43s/it]
                                                  

  9%|β–‰         | 150/1621 [06:55<35:03,  1.43s/it]
  9%|β–‰         | 151/1621 [06:57<36:42,  1.50s/it]
  9%|β–‰         | 152/1621 [06:58<36:43,  1.50s/it]
  9%|β–‰         | 153/1621 [07:00<38:25,  1.57s/it]
 10%|β–‰         | 154/1621 [07:01<37:12,  1.52s/it]
 10%|β–‰         | 155/1621 [07:03<37:08,  1.52s/it]
 10%|β–‰         | 156/1621 [07:04<36:52,  1.51s/it]
 10%|β–‰         | 157/1621 [07:06<36:39,  1.50s/it]
 10%|β–‰         | 158/1621 [07:07<36:16,  1.49s/it]
 10%|β–‰         | 159/1621 [07:09<35:54,  1.47s/it]
 10%|β–‰         | 160/1621 [07:10<37:48,  1.55s/it]
                                                  

 10%|β–‰         | 160/1621 [07:10<37:48,  1.55s/it]
 10%|β–‰         | 161/1621 [07:12<37:09,  1.53s/it]
 10%|β–‰         | 162/1621 [07:13<36:24,  1.50s/it]
 10%|β–ˆ         | 163/1621 [07:15<35:42,  1.47s/it]
 10%|β–ˆ  
0: {'loss': 0.2822, 'grad_norm': 0.30913370447364324, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.1}
0:        | 164/1621 [07:16<35:13,  1.45s/it]
 10%|β–ˆ         | 165/1621 [07:18<35:36,  1.47s/it]
 10%|β–ˆ         | 166/1621 [07:19<35:31,  1.46s/it]
 10%|β–ˆ         | 167/1621 [07:21<35:11,  1.45s/it]
 10%|β–ˆ         | 168/1621 [07:22<35:03,  1.45s/it]
 10%|β–ˆ         | 169/1621 [07:23<34:49,  1.44s/it]
 10%|β–ˆ         | 170/1621 [07:25<34:34,  1.43s/it]
                                                  

 10%|β–ˆ         | 170/1621 [07:25<34:34,  1.43s/it]
 11%|β–ˆ         | 171/1621 [07:26<34:39,  1.43s/it]
 11%|β–ˆ         | 172/1621 [07:28<34:59,  1.45s/it]
 11%|β–ˆ         | 173/1621 [07:29<34:44,  1.44s/it]
 11%|β–ˆ         | 174/1621 [07:31<34:48,  1.44s/it]
 11%|β–ˆ         | 175/1621 [07:32<34:35,  1.44s/it]
 11%|β–ˆ         | 176/1621 [07:34<36:14,  1.50s/it]
 11%|β–ˆ         | 177/1621 [07:35<36:48,  1.53s/it]
 11%|β–ˆ         | 178/1621 [07:37<37:07,  1.54s/it]
 11%|β–ˆ         | 179/1621 [07:38<36:20,  1.51s/it]
 11%|β–ˆ         | 180/1621 [07:40<35:35,  1.48s/it]
                            
0: {'loss': 0.2757, 'grad_norm': 0.30350079506000416, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.11}
0: {'loss': 0.2709, 'grad_norm': 0.31655651271681584, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.12}
0:                       

 11%|β–ˆ         | 180/1621 [07:40<35:35,  1.48s/it]
 11%|β–ˆ         | 181/1621 [07:41<37:12,  1.55s/it]
 11%|β–ˆ         | 182/1621 [07:43<36:18,  1.51s/it]
 11%|β–ˆβ–        | 183/1621 [07:44<35:33,  1.48s/it]
 11%|β–ˆβ–        | 184/1621 [07:46<35:08,  1.47s/it]
 11%|β–ˆβ–        | 185/1621 [07:47<35:46,  1.49s/it]
 11%|β–ˆβ–        | 186/1621 [07:49<36:02,  1.51s/it]
 12%|β–ˆβ–        | 187/1621 [07:50<36:06,  1.51s/it]
 12%|β–ˆβ–        | 188/1621 [07:52<35:49,  1.50s/it]
 12%|β–ˆβ–        | 189/1621 [07:53<35:16,  1.48s/it]
 12%|β–ˆβ–        | 190/1621 [07:55<35:37,  1.49s/it]
                                                  

 12%|β–ˆβ–        | 190/1621 [07:55<35:37,  1.49s/it]
 12%|β–ˆβ–        | 191/1621 [07:56<36:04,  1.51s/it]
 12%|β–ˆβ–        | 192/1621 [07:58<36:16,  1.52s/it]
 12%|β–ˆβ–        | 193/1621 [07:59<37:00,  1.56s/it]
 12%|β–ˆβ–        | 194/1621 [08:01<36:01,  1.51s/it]
 12%|β–ˆβ–        | 195/1621 [08:03<37:00,  1.56s/it]
 12%|β–ˆβ–        
0: {'loss': 0.2814, 'grad_norm': 0.3044835281174804, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.12}
0: {'loss': 0.2749, 'grad_norm': 0.3002036469553508, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.13}
0: | 196/1621 [08:04<35:52,  1.51s/it]
 12%|β–ˆβ–        | 197/1621 [08:05<35:16,  1.49s/it]
 12%|β–ˆβ–        | 198/1621 [08:07<34:42,  1.46s/it]
 12%|β–ˆβ–        | 199/1621 [08:08<34:24,  1.45s/it]
 12%|β–ˆβ–        | 200/1621 [08:10<34:04,  1.44s/it]
                                                  

 12%|β–ˆβ–        | 200/1621 [08:10<34:04,  1.44s/it]
 12%|β–ˆβ–        | 201/1621 [08:11<33:54,  1.43s/it]
 12%|β–ˆβ–        | 202/1621 [08:12<33:39,  1.42s/it]
 13%|β–ˆβ–Ž        | 203/1621 [08:14<33:31,  1.42s/it]
 13%|β–ˆβ–Ž        | 204/1621 [08:15<34:04,  1.44s/it]
 13%|β–ˆβ–Ž        | 205/1621 [08:17<34:24,  1.46s/it]
 13%|β–ˆβ–Ž        | 206/1621 [08:18<34:14,  1.45s/it]
 13%|β–ˆβ–Ž        | 207/1621 [08:20<33:53,  1.44s/it]
 13%|β–ˆβ–Ž        | 208/1621 [08:21<34:09,  1.45s/it]
 13%|β–ˆβ–Ž        | 209/1621 [08:23<34:02,  1.45s/it]
 13%|β–ˆβ–Ž        | 210/1621 [08:24<34:30,  1.47s/it]
                                                  

 13%|β–ˆβ–Ž        | 210/1621 [08:24<34:30,  1.47s/it]
 13%
0: {'loss': 0.2797, 'grad_norm': 0.2967747880428994, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.14}
0: |β–ˆβ–Ž        | 211/1621 [08:26<34:10,  1.45s/it]
 13%|β–ˆβ–Ž        | 212/1621 [08:27<33:48,  1.44s/it]
 13%|β–ˆβ–Ž        | 213/1621 [08:28<33:37,  1.43s/it]
 13%|β–ˆβ–Ž        | 214/1621 [08:30<33:32,  1.43s/it]
 13%|β–ˆβ–Ž        | 215/1621 [08:31<33:21,  1.42s/it]
 13%|β–ˆβ–Ž        | 216/1621 [08:33<34:03,  1.45s/it]
 13%|β–ˆβ–Ž        | 217/1621 [08:34<33:43,  1.44s/it]
 13%|β–ˆβ–Ž        | 218/1621 [08:36<33:46,  1.44s/it]
 14%|β–ˆβ–Ž        | 219/1621 [08:37<33:29,  1.43s/it]
 14%|β–ˆβ–Ž        | 220/1621 [08:39<35:14,  1.51s/it]
                                                  

 14%|β–ˆβ–Ž        | 220/1621 [08:39<35:14,  1.51s/it]
 14%|β–ˆβ–Ž        | 221/1621 [08:40<34:47,  1.49s/it]
 14%|β–ˆβ–Ž        | 222/1621 [08:42<34:32,  1.48s/it]
 14%|β–ˆβ–        | 223/1621 [08:43<34:02,  1.46s/it]
 14%|β–ˆβ–        | 224/1621 [08:44<34:29,  1.48s/it]
 14%|β–ˆβ–        | 225/1621 [08:46<34:08,  1.47s/it]
 14%|β–ˆβ–        | 226/1621 [08:47<33:39,  1.45s/it]
 14%|β–ˆβ–        | 227/1621 [08:49<34:0
0: {'loss': 0.267, 'grad_norm': 0.31143673730927085, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.14}
0: {'loss': 0.2743, 'grad_norm': 0.3155526747255406, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.15}
0: 2,  1.47s/it]
 14%|β–ˆβ–        | 228/1621 [08:50<34:07,  1.47s/it]
 14%|β–ˆβ–        | 229/1621 [08:52<33:45,  1.46s/it]
 14%|β–ˆβ–        | 230/1621 [08:53<33:39,  1.45s/it]
                                                  

 14%|β–ˆβ–        | 230/1621 [08:53<33:39,  1.45s/it]
 14%|β–ˆβ–        | 231/1621 [08:55<33:31,  1.45s/it]
 14%|β–ˆβ–        | 232/1621 [08:56<33:29,  1.45s/it]
 14%|β–ˆβ–        | 233/1621 [08:58<33:29,  1.45s/it]
 14%|β–ˆβ–        | 234/1621 [08:59<33:17,  1.44s/it]
 14%|β–ˆβ–        | 235/1621 [09:00<33:31,  1.45s/it]
 15%|β–ˆβ–        | 236/1621 [09:02<33:28,  1.45s/it]
 15%|β–ˆβ–        | 237/1621 [09:03<33:31,  1.45s/it]
 15%|β–ˆβ–        | 238/1621 [09:05<33:07,  1.44s/it]
 15%|β–ˆβ–        | 239/1621 [09:06<33:20,  1.45s/it]
 15%|β–ˆβ–        | 240/1621 [09:08<33:17,  1.45s/it]
                                                  

 15%|β–ˆβ–        | 240/1621 [09:08<33:17,  1.45s/it]
 15%|β–ˆβ–        | 241/1621 [09:09<33:16,  1.45s/it]
 15%|β–ˆβ–        | 242/1
0: {'loss': 0.2788, 'grad_norm': 0.3136432802461907, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.15}
0: 621 [09:11<33:18,  1.45s/it]
 15%|β–ˆβ–        | 243/1621 [09:12<33:03,  1.44s/it]
 15%|β–ˆβ–Œ        | 244/1621 [09:14<35:31,  1.55s/it]
 15%|β–ˆβ–Œ        | 245/1621 [09:15<34:31,  1.51s/it]
 15%|β–ˆβ–Œ        | 246/1621 [09:17<34:07,  1.49s/it]
 15%|β–ˆβ–Œ        | 247/1621 [09:18<33:36,  1.47s/it]
 15%|β–ˆβ–Œ        | 248/1621 [09:19<33:11,  1.45s/it]
 15%|β–ˆβ–Œ        | 249/1621 [09:21<33:12,  1.45s/it]
 15%|β–ˆβ–Œ        | 250/1621 [09:22<33:00,  1.44s/it]
                                                  

 15%|β–ˆβ–Œ        | 250/1621 [09:22<33:00,  1.44s/it]
 15%|β–ˆβ–Œ        | 251/1621 [09:24<33:01,  1.45s/it]
 16%|β–ˆβ–Œ        | 252/1621 [09:25<32:56,  1.44s/it]
 16%|β–ˆβ–Œ        | 253/1621 [09:27<33:06,  1.45s/it]
 16%|β–ˆβ–Œ        | 254/1621 [09:28<33:39,  1.48s/it]
 16%|β–ˆβ–Œ        | 255/1621 [09:30<33:17,  1.46s/it]
 16%|β–ˆβ–Œ        | 256/1621 [09:31<33:37,  1.48s/it]
 16%|β–ˆβ–Œ        | 257/1621 [09:33<33:18,  1.46s/it]
 16%|β–ˆβ–Œ        | 258/1621 [09:34<32:51,  1.45s/it]
 16%|β–ˆ
0: {'loss': 0.2809, 'grad_norm': 0.3423992975218093, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.16}
0: {'loss': 0.2701, 'grad_norm': 0.3288994173746047, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.17}
0: β–Œ        | 259/1621 [09:35<32:43,  1.44s/it]
 16%|β–ˆβ–Œ        | 260/1621 [09:37<32:41,  1.44s/it]
                                                  

 16%|β–ˆβ–Œ        | 260/1621 [09:37<32:41,  1.44s/it]
 16%|β–ˆβ–Œ        | 261/1621 [09:38<32:38,  1.44s/it]
 16%|β–ˆβ–Œ        | 262/1621 [09:40<35:34,  1.57s/it]
 16%|β–ˆβ–Œ        | 263/1621 [09:42<34:53,  1.54s/it]
 16%|β–ˆβ–‹        | 264/1621 [09:43<34:05,  1.51s/it]
 16%|β–ˆβ–‹        | 265/1621 [09:45<33:58,  1.50s/it]
 16%|β–ˆβ–‹        | 266/1621 [09:46<33:35,  1.49s/it]
 16%|β–ˆβ–‹        | 267/1621 [09:47<32:58,  1.46s/it]
 17%|β–ˆβ–‹        | 268/1621 [09:49<32:32,  1.44s/it]
 17%|β–ˆβ–‹        | 269/1621 [09:50<33:35,  1.49s/it]
 17%|β–ˆβ–‹        | 270/1621 [09:52<33:04,  1.47s/it]
                                                  

 17%|β–ˆβ–‹        | 270/1621 [09:52<33:04,  1.47s/it]
 17%|β–ˆβ–‹        | 271/1621 [09:53<32:41,  1.45s/it]
 17%|β–ˆβ–‹        | 272/1621 [09:55<34:13,  1.52s/it]
 17%|β–ˆβ–‹        | 273/1621 [09:56<33:25,  1.4
0: {'loss': 0.2807, 'grad_norm': 0.29274925430224524, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.17}
0: 9s/it]
 17%|β–ˆβ–‹        | 274/1621 [09:58<32:57,  1.47s/it]
 17%|β–ˆβ–‹        | 275/1621 [09:59<32:34,  1.45s/it]
 17%|β–ˆβ–‹        | 276/1621 [10:01<32:38,  1.46s/it]
 17%|β–ˆβ–‹        | 277/1621 [10:02<32:33,  1.45s/it]
 17%|β–ˆβ–‹        | 278/1621 [10:04<32:28,  1.45s/it]
 17%|β–ˆβ–‹        | 279/1621 [10:05<32:07,  1.44s/it]
 17%|β–ˆβ–‹        | 280/1621 [10:06<31:50,  1.42s/it]
                                                  

 17%|β–ˆβ–‹        | 280/1621 [10:06<31:50,  1.42s/it]
 17%|β–ˆβ–‹        | 281/1621 [10:08<31:48,  1.42s/it]
 17%|β–ˆβ–‹        | 282/1621 [10:09<31:38,  1.42s/it]
 17%|β–ˆβ–‹        | 283/1621 [10:11<32:33,  1.46s/it]
 18%|β–ˆβ–Š        | 284/1621 [10:12<32:10,  1.44s/it]
 18%|β–ˆβ–Š        | 285/1621 [10:14<32:00,  1.44s/it]
 18%|β–ˆβ–Š        | 286/1621 [10:15<33:05,  1.49s/it]
 18%|β–ˆβ–Š        | 287/1621 [10:17<32:53,  1.48s/it]
 18%|β–ˆβ–Š        | 288/1621 [10:18<32:42,  1.47s/it]
 18%|β–ˆβ–Š        | 289/1621 [10:19<32:14,  1.45s/it]
 18%|β–ˆβ–Š        | 290/1621 
0: {'loss': 0.2773, 'grad_norm': 0.31838982571156305, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.18}
0: {'loss': 0.278, 'grad_norm': 0.32447615695347176, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.19}
0: [10:21<32:04,  1.45s/it]
                                                  

 18%|β–ˆβ–Š        | 290/1621 [10:21<32:04,  1.45s/it]
 18%|β–ˆβ–Š        | 291/1621 [10:23<33:39,  1.52s/it]
 18%|β–ˆβ–Š        | 292/1621 [10:24<32:57,  1.49s/it]
 18%|β–ˆβ–Š        | 293/1621 [10:26<33:42,  1.52s/it]
 18%|β–ˆβ–Š        | 294/1621 [10:27<33:04,  1.50s/it]
 18%|β–ˆβ–Š        | 295/1621 [10:29<34:11,  1.55s/it]
 18%|β–ˆβ–Š        | 296/1621 [10:30<33:22,  1.51s/it]
 18%|β–ˆβ–Š        | 297/1621 [10:32<32:50,  1.49s/it]
 18%|β–ˆβ–Š        | 298/1621 [10:33<32:43,  1.48s/it]
 18%|β–ˆβ–Š        | 299/1621 [10:34<32:19,  1.47s/it]
 19%|β–ˆβ–Š        | 300/1621 [10:36<32:09,  1.46s/it]
                                                  

 19%|β–ˆβ–Š        | 300/1621 [10:36<32:09,  1.46s/it]
 19%|β–ˆβ–Š        | 301/1621 [10:37<31:51,  1.45s/it]
 19%|β–ˆβ–Š        | 302/1621 [10:39<31:35,  1.44s/it]
 19%|β–ˆβ–Š        | 303/1621 [10:40<31:23,  1.43s/it]
 19%|β–ˆβ–‰        | 304/1621 [10:42<31:30,  1.44s/it]
 19%|β–ˆβ–‰    
0: {'loss': 0.2753, 'grad_norm': 0.342886066749926, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.35, 'epoch': 0.19}
0: {'loss': 0.2776, 'grad_norm': 0.30166316287017786, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.2}
0:     | 305/1621 [10:43<32:15,  1.47s/it]
 19%|β–ˆβ–‰        | 306/1621 [10:45<31:50,  1.45s/it]
 19%|β–ˆβ–‰        | 307/1621 [10:46<31:42,  1.45s/it]
 19%|β–ˆβ–‰        | 308/1621 [10:47<31:42,  1.45s/it]
 19%|β–ˆβ–‰        | 309/1621 [10:49<31:25,  1.44s/it]
 19%|β–ˆβ–‰        | 310/1621 [10:50<31:20,  1.43s/it]
                                                  

 19%|β–ˆβ–‰        | 310/1621 [10:50<31:20,  1.43s/it]
 19%|β–ˆβ–‰        | 311/1621 [10:52<31:25,  1.44s/it]
 19%|β–ˆβ–‰        | 312/1621 [10:53<31:24,  1.44s/it]
 19%|β–ˆβ–‰        | 313/1621 [10:55<32:04,  1.47s/it]
 19%|β–ˆβ–‰        | 314/1621 [10:56<32:53,  1.51s/it]
 19%|β–ˆβ–‰        | 315/1621 [10:58<32:18,  1.48s/it]
 19%|β–ˆβ–‰        | 316/1621 [10:59<31:48,  1.46s/it]
 20%|β–ˆβ–‰        | 317/1621 [11:01<31:27,  1.45s/it]
 20%|β–ˆβ–‰        | 318/1621 [11:02<32:08,  1.48s/it]
 20%|β–ˆβ–‰        | 319/1621 [11:04<31:47,  1.47s/it]
 20%|β–ˆβ–‰        | 320/1621 [11:05<31:25,  1.45s/it]
                                                  

0: [2025-09-02 18:59:03,462] [INFO] [axolotl.core.trainers.base._save:613] [PID:1478787] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0/checkpoint-325
0: [2025-09-02 18:59:08,211] [INFO] [axolotl.core.trainers.base._save:662] [PID:1478787] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
0: {'loss': 0.2799, 'grad_norm': 0.3399264182401738, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.2}
0:  20%|β–ˆβ–‰        | 320/1621 [11:05<31:25,  1.45s/it]
 20%|β–ˆβ–‰        | 321/1621 [11:06<31:13,  1.44s/it]
 20%|β–ˆβ–‰        | 322/1621 [11:08<35:05,  1.62s/it]
 20%|β–ˆβ–‰        | 323/1621 [11:10<33:43,  1.56s/it]
 20%|β–ˆβ–‰        | 324/1621 [11:11<32:41,  1.51s/it]
 20%|β–ˆβ–ˆ        | 325/1621 [11:13<32:05,  1.49s/it]
 20%|β–ˆβ–ˆ        | 326/1621 [11:23<1:32:18,  4.28s/it]
 20%|β–ˆβ–ˆ        | 327/1621 [11:25<1:15:29,  3.50s/it]
 20%|β–ˆβ–ˆ        | 328/1621 [11:27<1:02:02,  2.88s/it]
 20%|β–ˆβ–ˆ        | 329/1621 [11:28<53:10,  2.47s/it]  
 20%|β–ˆβ–ˆ        | 330/1621 [11:30<46:38,  2.17s/it]
                                                  

 20%|β–ˆβ–ˆ        | 330/1621 [11:30<46:38,  2.17s/it]
 20%|β–ˆβ–ˆ        | 331/1621 [11:31<41:45,  1.94s/it]
 20%|β–ˆβ–ˆ        | 332/1621 [11:33<39:29,  1.84s/it]
 21%|β–ˆβ–ˆ        | 333/1621 [11:34<36:48,  1.71s/it]
 21%|β–ˆβ–ˆ        | 334/1621 [11:36<35:24,  1.65s/it]
 21%|β–ˆβ–ˆ        | 335/1621 [11:37<34:03,  1.59s/it]
 21%|β–ˆβ–ˆ        | 336/1621
0: {'loss': 0.268, 'grad_norm': 0.3251493921539059, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.21}
0: {'loss': 0.2727, 'grad_norm': 0.3288623888122016, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.22}
0:  [11:39<34:18,  1.60s/it]
 21%|β–ˆβ–ˆ        | 337/1621 [11:40<33:44,  1.58s/it]
 21%|β–ˆβ–ˆ        | 338/1621 [11:42<32:55,  1.54s/it]
 21%|β–ˆβ–ˆ        | 339/1621 [11:43<32:29,  1.52s/it]
 21%|β–ˆβ–ˆ        | 340/1621 [11:45<32:21,  1.52s/it]
                                                  

 21%|β–ˆβ–ˆ        | 340/1621 [11:45<32:21,  1.52s/it]
 21%|β–ˆβ–ˆ        | 341/1621 [11:46<33:41,  1.58s/it]
 21%|β–ˆβ–ˆ        | 342/1621 [11:48<33:55,  1.59s/it]
 21%|β–ˆβ–ˆ        | 343/1621 [11:49<32:46,  1.54s/it]
 21%|β–ˆβ–ˆ        | 344/1621 [11:51<32:27,  1.53s/it]
 21%|β–ˆβ–ˆβ–       | 345/1621 [11:52<31:48,  1.50s/it]
 21%|β–ˆβ–ˆβ–       | 346/1621 [11:54<31:13,  1.47s/it]
 21%|β–ˆβ–ˆβ–       | 347/1621 [11:55<31:07,  1.47s/it]
 21%|β–ˆβ–ˆβ–       | 348/1621 [11:57<30:46,  1.45s/it]
 22%|β–ˆβ–ˆβ–       | 349/1621 [11:58<30:38,  1.45s/it]
 22%|β–ˆβ–ˆβ–       | 350/1621 [11:59<30:35,  1.44s/it]
                                                  

 22%|β–ˆβ–ˆβ–       | 350/1621 [11:59<30:35,  1.44s/it]
0: {'loss': 0.2751, 'grad_norm': 0.3024279886733832, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.22}
0:  22%|β–ˆβ–ˆβ–       | 351/1621 [12:01<30:23,  1.44s/it]
 22%|β–ˆβ–ˆβ–       | 352/1621 [12:02<30:25,  1.44s/it]
 22%|β–ˆβ–ˆβ–       | 353/1621 [12:04<30:17,  1.43s/it]
 22%|β–ˆβ–ˆβ–       | 354/1621 [12:05<30:10,  1.43s/it]
 22%|β–ˆβ–ˆβ–       | 355/1621 [12:07<30:06,  1.43s/it]
 22%|β–ˆβ–ˆβ–       | 356/1621 [12:08<30:00,  1.42s/it]
 22%|β–ˆβ–ˆβ–       | 357/1621 [12:09<30:09,  1.43s/it]
 22%|β–ˆβ–ˆβ–       | 358/1621 [12:11<30:13,  1.44s/it]
 22%|β–ˆβ–ˆβ–       | 359/1621 [12:12<30:31,  1.45s/it]
 22%|β–ˆβ–ˆβ–       | 360/1621 [12:14<30:22,  1.45s/it]
                                                  

 22%|β–ˆβ–ˆβ–       | 360/1621 [12:14<30:22,  1.45s/it]
 22%|β–ˆβ–ˆβ–       | 361/1621 [12:15<30:19,  1.44s/it]
 22%|β–ˆβ–ˆβ–       | 362/1621 [12:17<30:30,  1.45s/it]
 22%|β–ˆβ–ˆβ–       | 363/1621 [12:18<30:11,  1.44s/it]
 22%|β–ˆβ–ˆβ–       | 364/1621 [12:19<30:05,  1.44s/it]
 23%|β–ˆβ–ˆβ–Ž       | 365/1621 [12:21<30:27,  1.45s/it]
 23%|β–ˆβ–ˆβ–Ž       | 366/1621 [12:22<30:21,  1.45s/it]
 23
0: {'loss': 0.2709, 'grad_norm': 0.32115118197616366, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.23}
0: {'loss': 0.2705, 'grad_norm': 0.3241814781093123, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.23}
0: %|β–ˆβ–ˆβ–Ž       | 367/1621 [12:24<30:09,  1.44s/it]
 23%|β–ˆβ–ˆβ–Ž       | 368/1621 [12:25<30:15,  1.45s/it]
 23%|β–ˆβ–ˆβ–Ž       | 369/1621 [12:27<30:28,  1.46s/it]
 23%|β–ˆβ–ˆβ–Ž       | 370/1621 [12:28<30:24,  1.46s/it]
                                                  

 23%|β–ˆβ–ˆβ–Ž       | 370/1621 [12:28<30:24,  1.46s/it]
 23%|β–ˆβ–ˆβ–Ž       | 371/1621 [12:30<30:14,  1.45s/it]
 23%|β–ˆβ–ˆβ–Ž       | 372/1621 [12:31<29:52,  1.44s/it]
 23%|β–ˆβ–ˆβ–Ž       | 373/1621 [12:33<29:41,  1.43s/it]
 23%|β–ˆβ–ˆβ–Ž       | 374/1621 [12:34<29:43,  1.43s/it]
 23%|β–ˆβ–ˆβ–Ž       | 375/1621 [12:36<31:24,  1.51s/it]
 23%|β–ˆβ–ˆβ–Ž       | 376/1621 [12:37<30:43,  1.48s/it]
 23%|β–ˆβ–ˆβ–Ž       | 377/1621 [12:39<30:34,  1.47s/it]
 23%|β–ˆβ–ˆβ–Ž       | 378/1621 [12:40<30:17,  1.46s/it]
 23%|β–ˆβ–ˆβ–Ž       | 379/1621 [12:41<29:55,  1.45s/it]
 23%|β–ˆβ–ˆβ–Ž       | 380/1621 [12:43<29:45,  1.44s/it]
                                                  

 23%|β–ˆβ–ˆβ–Ž       | 380/1621 [12:43<29:45,  1.44s/it]
 24%|β–ˆβ–ˆ
0: {'loss': 0.2701, 'grad_norm': 0.3202907610900123, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.24}
0: β–Ž       | 381/1621 [12:44<29:42,  1.44s/it]
 24%|β–ˆβ–ˆβ–Ž       | 382/1621 [12:46<32:44,  1.59s/it]
 24%|β–ˆβ–ˆβ–Ž       | 383/1621 [12:48<31:35,  1.53s/it]
 24%|β–ˆβ–ˆβ–Ž       | 384/1621 [12:49<30:44,  1.49s/it]
 24%|β–ˆβ–ˆβ–       | 385/1621 [12:50<30:18,  1.47s/it]
 24%|β–ˆβ–ˆβ–       | 386/1621 [12:52<30:02,  1.46s/it]
 24%|β–ˆβ–ˆβ–       | 387/1621 [12:53<30:11,  1.47s/it]
 24%|β–ˆβ–ˆβ–       | 388/1621 [12:55<30:15,  1.47s/it]
 24%|β–ˆβ–ˆβ–       | 389/1621 [12:56<31:05,  1.51s/it]
 24%|β–ˆβ–ˆβ–       | 390/1621 [12:58<30:29,  1.49s/it]
                                                  

 24%|β–ˆβ–ˆβ–       | 390/1621 [12:58<30:29,  1.49s/it]
 24%|β–ˆβ–ˆβ–       | 391/1621 [12:59<30:03,  1.47s/it]
 24%|β–ˆβ–ˆβ–       | 392/1621 [13:01<29:45,  1.45s/it]
 24%|β–ˆβ–ˆβ–       | 393/1621 [13:02<29:36,  1.45s/it]
 24%|β–ˆβ–ˆβ–       | 394/1621 [13:04<31:00,  1.52s/it]
 24%|β–ˆβ–ˆβ–       | 395/1621 [13:05<30:50,  1.51s/it]
 24%|β–ˆβ–ˆβ–       | 396/1621 [13:07<30:19,  1.49s/it]
 24%|β–ˆβ–ˆβ–
0: {'loss': 0.268, 'grad_norm': 0.34656413820425974, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.25}
0: {'loss': 0.2643, 'grad_norm': 0.31693873851656673, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.25}
0:        | 397/1621 [13:08<29:51,  1.46s/it]
 25%|β–ˆβ–ˆβ–       | 398/1621 [13:10<32:15,  1.58s/it]
 25%|β–ˆβ–ˆβ–       | 399/1621 [13:11<31:10,  1.53s/it]
 25%|β–ˆβ–ˆβ–       | 400/1621 [13:13<30:33,  1.50s/it]
                                                  

 25%|β–ˆβ–ˆβ–       | 400/1621 [13:13<30:33,  1.50s/it]
 25%|β–ˆβ–ˆβ–       | 401/1621 [13:14<30:38,  1.51s/it]
 25%|β–ˆβ–ˆβ–       | 402/1621 [13:16<30:12,  1.49s/it]
 25%|β–ˆβ–ˆβ–       | 403/1621 [13:17<29:42,  1.46s/it]
 25%|β–ˆβ–ˆβ–       | 404/1621 [13:19<31:40,  1.56s/it]
 25%|β–ˆβ–ˆβ–       | 405/1621 [13:20<30:45,  1.52s/it]
 25%|β–ˆβ–ˆβ–Œ       | 406/1621 [13:22<31:22,  1.55s/it]
 25%|β–ˆβ–ˆβ–Œ       | 407/1621 [13:23<30:36,  1.51s/it]
 25%|β–ˆβ–ˆβ–Œ       | 408/1621 [13:25<29:51,  1.48s/it]
 25%|β–ˆβ–ˆβ–Œ       | 409/1621 [13:26<29:27,  1.46s/it]
 25%|β–ˆβ–ˆβ–Œ       | 410/1621 [13:28<29:15,  1.45s/it]
                                                  

 25%|β–ˆβ–ˆβ–Œ       | 410/1621 [13:28<29:15,  1.45s/it]
 25%|β–ˆβ–ˆβ–Œ       |
0: {'loss': 0.2672, 'grad_norm': 0.2971789023650299, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.26}
0:  411/1621 [13:29<29:03,  1.44s/it]
 25%|β–ˆβ–ˆβ–Œ       | 412/1621 [13:31<30:57,  1.54s/it]
 25%|β–ˆβ–ˆβ–Œ       | 413/1621 [13:32<30:23,  1.51s/it]
 26%|β–ˆβ–ˆβ–Œ       | 414/1621 [13:34<29:45,  1.48s/it]
 26%|β–ˆβ–ˆβ–Œ       | 415/1621 [13:35<29:49,  1.48s/it]
 26%|β–ˆβ–ˆβ–Œ       | 416/1621 [13:37<29:25,  1.46s/it]
 26%|β–ˆβ–ˆβ–Œ       | 417/1621 [13:38<30:17,  1.51s/it]
 26%|β–ˆβ–ˆβ–Œ       | 418/1621 [13:40<30:55,  1.54s/it]
 26%|β–ˆβ–ˆβ–Œ       | 419/1621 [13:41<30:36,  1.53s/it]
 26%|β–ˆβ–ˆβ–Œ       | 420/1621 [13:43<30:11,  1.51s/it]
                                                  

 26%|β–ˆβ–ˆβ–Œ       | 420/1621 [13:43<30:11,  1.51s/it]
 26%|β–ˆβ–ˆβ–Œ       | 421/1621 [13:44<30:35,  1.53s/it]
 26%|β–ˆβ–ˆβ–Œ       | 422/1621 [13:46<30:04,  1.50s/it]
 26%|β–ˆβ–ˆβ–Œ       | 423/1621 [13:47<29:41,  1.49s/it]
 26%|β–ˆβ–ˆβ–Œ       | 424/1621 [13:49<29:19,  1.47s/it]
 26%|β–ˆβ–ˆβ–Œ       | 425/1621 [13:50<29:01,  1.46s/it]
 26%|β–ˆβ–ˆβ–‹       | 426/1621 [13:52<29:34,  1.48s/it]
 26%|β–ˆβ–ˆβ–‹       | 42
0: {'loss': 0.2613, 'grad_norm': 0.31251578412801284, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.27}
0: {'loss': 0.2693, 'grad_norm': 0.30507923848851765, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.27}
0: 7/1621 [13:53<29:49,  1.50s/it]
 26%|β–ˆβ–ˆβ–‹       | 428/1621 [13:55<29:39,  1.49s/it]
 26%|β–ˆβ–ˆβ–‹       | 429/1621 [13:56<29:11,  1.47s/it]
 27%|β–ˆβ–ˆβ–‹       | 430/1621 [13:58<30:02,  1.51s/it]
                                                  

 27%|β–ˆβ–ˆβ–‹       | 430/1621 [13:58<30:02,  1.51s/it]
 27%|β–ˆβ–ˆβ–‹       | 431/1621 [13:59<30:46,  1.55s/it]
 27%|β–ˆβ–ˆβ–‹       | 432/1621 [14:01<30:03,  1.52s/it]
 27%|β–ˆβ–ˆβ–‹       | 433/1621 [14:02<29:30,  1.49s/it]
 27%|β–ˆβ–ˆβ–‹       | 434/1621 [14:04<31:03,  1.57s/it]
 27%|β–ˆβ–ˆβ–‹       | 435/1621 [14:05<30:09,  1.53s/it]
 27%|β–ˆβ–ˆβ–‹       | 436/1621 [14:07<29:34,  1.50s/it]
 27%|β–ˆβ–ˆβ–‹       | 437/1621 [14:08<29:30,  1.50s/it]
 27%|β–ˆβ–ˆβ–‹       | 438/1621 [14:10<30:06,  1.53s/it]
 27%|β–ˆβ–ˆβ–‹       | 439/1621 [14:11<29:32,  1.50s/it]
 27%|β–ˆβ–ˆβ–‹       | 440/1621 [14:13<28:58,  1.47s/it]
                                                  

 27%|β–ˆβ–ˆβ–‹       | 440/1621 [14:13<28:58,  1.47s/it]
 27%|β–ˆβ–ˆβ–‹       | 441/1621 [
0: {'loss': 0.2694, 'grad_norm': 0.307964171218113, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.28}
0: 14:14<28:33,  1.45s/it]
 27%|β–ˆβ–ˆβ–‹       | 442/1621 [14:16<28:20,  1.44s/it]
 27%|β–ˆβ–ˆβ–‹       | 443/1621 [14:17<28:29,  1.45s/it]
 27%|β–ˆβ–ˆβ–‹       | 444/1621 [14:19<28:25,  1.45s/it]
 27%|β–ˆβ–ˆβ–‹       | 445/1621 [14:20<28:24,  1.45s/it]
 28%|β–ˆβ–ˆβ–Š       | 446/1621 [14:21<28:17,  1.44s/it]
 28%|β–ˆβ–ˆβ–Š       | 447/1621 [14:23<28:02,  1.43s/it]
 28%|β–ˆβ–ˆβ–Š       | 448/1621 [14:24<27:58,  1.43s/it]
 28%|β–ˆβ–ˆβ–Š       | 449/1621 [14:26<27:47,  1.42s/it]
 28%|β–ˆβ–ˆβ–Š       | 450/1621 [14:27<27:45,  1.42s/it]
                                                  

 28%|β–ˆβ–ˆβ–Š       | 450/1621 [14:27<27:45,  1.42s/it]
 28%|β–ˆβ–ˆβ–Š       | 451/1621 [14:28<27:51,  1.43s/it]
 28%|β–ˆβ–ˆβ–Š       | 452/1621 [14:30<27:51,  1.43s/it]
 28%|β–ˆβ–ˆβ–Š       | 453/1621 [14:31<27:49,  1.43s/it]
 28%|β–ˆβ–ˆβ–Š       | 454/1621 [14:33<28:02,  1.44s/it]
 28%|β–ˆβ–ˆβ–Š       | 455/1621 [14:34<27:51,  1.43s/it]
 28%|β–ˆβ–ˆβ–Š       | 456/1621 [14:36<27:43,  1.43s/it]
 28%|β–ˆβ–ˆβ–Š       | 457/1621 [14:
0: {'loss': 0.2718, 'grad_norm': 0.3093931433367332, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.28}
0: {'loss': 0.2638, 'grad_norm': 0.3165329083358544, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.29}
0: 37<27:44,  1.43s/it]
 28%|β–ˆβ–ˆβ–Š       | 458/1621 [14:39<27:49,  1.44s/it]
 28%|β–ˆβ–ˆβ–Š       | 459/1621 [14:40<27:56,  1.44s/it]
 28%|β–ˆβ–ˆβ–Š       | 460/1621 [14:41<27:56,  1.44s/it]
                                                  

 28%|β–ˆβ–ˆβ–Š       | 460/1621 [14:41<27:56,  1.44s/it]
 28%|β–ˆβ–ˆβ–Š       | 461/1621 [14:43<27:49,  1.44s/it]
 29%|β–ˆβ–ˆβ–Š       | 462/1621 [14:44<27:38,  1.43s/it]
 29%|β–ˆβ–ˆβ–Š       | 463/1621 [14:46<27:29,  1.42s/it]
 29%|β–ˆβ–ˆβ–Š       | 464/1621 [14:47<29:02,  1.51s/it]
 29%|β–ˆβ–ˆβ–Š       | 465/1621 [14:49<28:58,  1.50s/it]
 29%|β–ˆβ–ˆβ–Š       | 466/1621 [14:50<28:34,  1.48s/it]
 29%|β–ˆβ–ˆβ–‰       | 467/1621 [14:52<29:23,  1.53s/it]
 29%|β–ˆβ–ˆβ–‰       | 468/1621 [14:53<28:45,  1.50s/it]
 29%|β–ˆβ–ˆβ–‰       | 469/1621 [14:55<28:16,  1.47s/it]
 29%|β–ˆβ–ˆβ–‰       | 470/1621 [14:56<27:54,  1.45s/it]
                                                  

 29%|β–ˆβ–ˆβ–‰       | 470/1621 [14:56<27:54,  1.45s/it]
 29%|β–ˆβ–ˆβ–‰       | 471/1621 [14:58<29:34
0: {'loss': 0.2692, 'grad_norm': 0.3101656418893075, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.3}
0: ,  1.54s/it]
 29%|β–ˆβ–ˆβ–‰       | 472/1621 [14:59<28:51,  1.51s/it]
 29%|β–ˆβ–ˆβ–‰       | 473/1621 [15:01<28:21,  1.48s/it]
 29%|β–ˆβ–ˆβ–‰       | 474/1621 [15:02<28:53,  1.51s/it]
 29%|β–ˆβ–ˆβ–‰       | 475/1621 [15:04<28:27,  1.49s/it]
 29%|β–ˆβ–ˆβ–‰       | 476/1621 [15:05<27:59,  1.47s/it]
 29%|β–ˆβ–ˆβ–‰       | 477/1621 [15:07<27:47,  1.46s/it]
 29%|β–ˆβ–ˆβ–‰       | 478/1621 [15:08<27:28,  1.44s/it]
 30%|β–ˆβ–ˆβ–‰       | 479/1621 [15:10<27:27,  1.44s/it]
 30%|β–ˆβ–ˆβ–‰       | 480/1621 [15:11<27:21,  1.44s/it]
                                                  

 30%|β–ˆβ–ˆβ–‰       | 480/1621 [15:11<27:21,  1.44s/it]
 30%|β–ˆβ–ˆβ–‰       | 481/1621 [15:13<28:29,  1.50s/it]
 30%|β–ˆβ–ˆβ–‰       | 482/1621 [15:14<27:58,  1.47s/it]
 30%|β–ˆβ–ˆβ–‰       | 483/1621 [15:15<27:39,  1.46s/it]
 30%|β–ˆβ–ˆβ–‰       | 484/1621 [15:17<27:37,  1.46s/it]
 30%|β–ˆβ–ˆβ–‰       | 485/1621 [15:18<27:22,  1.45s/it]
 30%|β–ˆβ–ˆβ–‰       | 486/1621 [15:20<27:17,  1.44s/it]
 30%|β–ˆβ–ˆβ–ˆ       | 487/1621 [15:21<27:06,  
0: {'loss': 0.2662, 'grad_norm': 0.30695490828095756, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.3}
0: {'loss': 0.2704, 'grad_norm': 0.32199906361703134, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.31}
0: 1.43s/it]
 30%|β–ˆβ–ˆβ–ˆ       | 488/1621 [15:23<27:11,  1.44s/it]
 30%|β–ˆβ–ˆβ–ˆ       | 489/1621 [15:24<28:34,  1.51s/it]
 30%|β–ˆβ–ˆβ–ˆ       | 490/1621 [15:26<28:01,  1.49s/it]
                                                  

 30%|β–ˆβ–ˆβ–ˆ       | 490/1621 [15:26<28:01,  1.49s/it]
 30%|β–ˆβ–ˆβ–ˆ       | 491/1621 [15:27<29:22,  1.56s/it]
 30%|β–ˆβ–ˆβ–ˆ       | 492/1621 [15:29<28:32,  1.52s/it]
 30%|β–ˆβ–ˆβ–ˆ       | 493/1621 [15:30<27:55,  1.49s/it]
 30%|β–ˆβ–ˆβ–ˆ       | 494/1621 [15:32<27:45,  1.48s/it]
 31%|β–ˆβ–ˆβ–ˆ       | 495/1621 [15:33<27:28,  1.46s/it]
 31%|β–ˆβ–ˆβ–ˆ       | 496/1621 [15:35<27:28,  1.47s/it]
 31%|β–ˆβ–ˆβ–ˆ       | 497/1621 [15:36<27:33,  1.47s/it]
 31%|β–ˆβ–ˆβ–ˆ       | 498/1621 [15:38<27:06,  1.45s/it]
 31%|β–ˆβ–ˆβ–ˆ       | 499/1621 [15:39<27:05,  1.45s/it]
 31%|β–ˆβ–ˆβ–ˆ       | 500/1621 [15:40<26:51,  1.44s/it]
                                                  

 31%|β–ˆβ–ˆβ–ˆ       | 500/1621 [15:40<26:51,  1.44s/it]
 31%|β–ˆβ–ˆβ–ˆ       | 501/1621 [15:42<26:54,  1.44s/it
0: {'loss': 0.2698, 'grad_norm': 0.3168722994354358, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.31}
0: ]
 31%|β–ˆβ–ˆβ–ˆ       | 502/1621 [15:43<27:53,  1.50s/it]
 31%|β–ˆβ–ˆβ–ˆ       | 503/1621 [15:45<27:32,  1.48s/it]
 31%|β–ˆβ–ˆβ–ˆ       | 504/1621 [15:46<27:12,  1.46s/it]
 31%|β–ˆβ–ˆβ–ˆ       | 505/1621 [15:48<26:55,  1.45s/it]
 31%|β–ˆβ–ˆβ–ˆ       | 506/1621 [15:49<26:47,  1.44s/it]
 31%|β–ˆβ–ˆβ–ˆβ–      | 507/1621 [15:51<26:32,  1.43s/it]
 31%|β–ˆβ–ˆβ–ˆβ–      | 508/1621 [15:52<26:23,  1.42s/it]
 31%|β–ˆβ–ˆβ–ˆβ–      | 509/1621 [15:54<27:44,  1.50s/it]
 31%|β–ˆβ–ˆβ–ˆβ–      | 510/1621 [15:55<27:21,  1.48s/it]
                                                  

 31%|β–ˆβ–ˆβ–ˆβ–      | 510/1621 [15:55<27:21,  1.48s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 511/1621 [15:57<27:29,  1.49s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 512/1621 [15:58<27:02,  1.46s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 513/1621 [15:59<26:44,  1.45s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 514/1621 [16:01<26:32,  1.44s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 515/1621 [16:02<26:21,  1.43s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 516/1621 [16:04<27:26,  1.49s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 517/1621 [1
0: {'loss': 0.2724, 'grad_norm': 0.3419596649304518, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.32}
0: {'loss': 0.2691, 'grad_norm': 0.3214826713689335, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.33}
0: 6:05<26:58,  1.47s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 518/1621 [16:07<26:37,  1.45s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 519/1621 [16:08<26:22,  1.44s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 520/1621 [16:09<26:13,  1.43s/it]
                                                  

 32%|β–ˆβ–ˆβ–ˆβ–      | 520/1621 [16:09<26:13,  1.43s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 521/1621 [16:11<26:21,  1.44s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 522/1621 [16:12<26:11,  1.43s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 523/1621 [16:14<26:26,  1.44s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 524/1621 [16:15<26:27,  1.45s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 525/1621 [16:17<26:41,  1.46s/it]
 32%|β–ˆβ–ˆβ–ˆβ–      | 526/1621 [16:18<27:06,  1.49s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 527/1621 [16:20<28:06,  1.54s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 528/1621 [16:22<28:58,  1.59s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 529/1621 [16:23<28:03,  1.54s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 530/1621 [16:25<27:21,  1.50s/it]
                                                  

 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 530/1621 [16:25<27:21,  1.50s/it]
 33%|β–ˆβ–ˆοΏ½
0: {'loss': 0.2723, 'grad_norm': 0.31076689228472376, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.33}
0: οΏ½οΏ½β–Ž      | 531/1621 [16:26<27:16,  1.50s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 532/1621 [16:28<27:28,  1.51s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 533/1621 [16:29<26:53,  1.48s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 534/1621 [16:31<28:29,  1.57s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 535/1621 [16:32<27:34,  1.52s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 536/1621 [16:34<28:02,  1.55s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 537/1621 [16:35<27:27,  1.52s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 538/1621 [16:37<27:09,  1.50s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 539/1621 [16:38<27:26,  1.52s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 540/1621 [16:40<27:29,  1.53s/it]
                                                  

 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 540/1621 [16:40<27:29,  1.53s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 541/1621 [16:41<27:11,  1.51s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 542/1621 [16:43<26:35,  1.48s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 543/1621 [16:44<26:11,  1.46s/it]
 34%|β–ˆβ–ˆβ–ˆβ–Ž      | 544/1621 [16:46<25:53,  1.44s/it]
 34%|β–ˆβ–ˆβ–ˆβ–Ž      | 545/1621 [16:47<25:43,  1.43s/it]
 34%|β–ˆβ–ˆβ–ˆβ–Ž      | 546/1621 [16:48
0: {'loss': 0.267, 'grad_norm': 0.3096662553218259, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.34}
0: {'loss': 0.2652, 'grad_norm': 0.3305774329274305, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.35}
0: <25:33,  1.43s/it]
 34%|β–ˆβ–ˆβ–ˆβ–Ž      | 547/1621 [16:50<25:40,  1.43s/it]
 34%|β–ˆβ–ˆβ–ˆβ–      | 548/1621 [16:51<25:37,  1.43s/it]
 34%|β–ˆβ–ˆβ–ˆβ–      | 549/1621 [16:53<25:57,  1.45s/it]
 34%|β–ˆβ–ˆβ–ˆβ–      | 550/1621 [16:54<26:54,  1.51s/it]
                                                  

 34%|β–ˆβ–ˆβ–ˆβ–      | 550/1621 [16:54<26:54,  1.51s/it]
 34%|β–ˆβ–ˆβ–ˆβ–      | 551/1621 [16:56<26:36,  1.49s/it]
 34%|β–ˆβ–ˆβ–ˆβ–      | 552/1621 [16:57<26:14,  1.47s/it]
 34%|β–ˆβ–ˆβ–ˆβ–      | 553/1621 [16:59<25:54,  1.46s/it]
 34%|β–ˆβ–ˆβ–ˆβ–      | 554/1621 [17:00<25:42,  1.45s/it]
 34%|β–ˆβ–ˆβ–ˆβ–      | 555/1621 [17:02<27:30,  1.55s/it]
 34%|β–ˆβ–ˆβ–ˆβ–      | 556/1621 [17:03<27:11,  1.53s/it]
 34%|β–ˆβ–ˆβ–ˆβ–      | 557/1621 [17:05<26:39,  1.50s/it]
 34%|β–ˆβ–ˆβ–ˆβ–      | 558/1621 [17:06<26:10,  1.48s/it]
 34%|β–ˆβ–ˆβ–ˆβ–      | 559/1621 [17:08<25:48,  1.46s/it]
 35%|β–ˆβ–ˆβ–ˆβ–      | 560/1621 [17:09<26:29,  1.50s/it]
                                                  

 35%|β–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.27, 'grad_norm': 0.3183396295658571, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.35}
0: οΏ½      | 560/1621 [17:09<26:29,  1.50s/it]
 35%|β–ˆβ–ˆβ–ˆβ–      | 561/1621 [17:11<26:17,  1.49s/it]
 35%|β–ˆβ–ˆβ–ˆβ–      | 562/1621 [17:12<25:50,  1.46s/it]
 35%|β–ˆβ–ˆβ–ˆβ–      | 563/1621 [17:14<25:33,  1.45s/it]
 35%|β–ˆβ–ˆβ–ˆβ–      | 564/1621 [17:15<25:34,  1.45s/it]
 35%|β–ˆβ–ˆβ–ˆβ–      | 565/1621 [17:16<25:23,  1.44s/it]
 35%|β–ˆβ–ˆβ–ˆβ–      | 566/1621 [17:18<25:08,  1.43s/it]
 35%|β–ˆβ–ˆβ–ˆβ–      | 567/1621 [17:19<24:59,  1.42s/it]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 568/1621 [17:21<24:56,  1.42s/it]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 569/1621 [17:22<26:16,  1.50s/it]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 570/1621 [17:24<26:08,  1.49s/it]
                                                  

 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 570/1621 [17:24<26:08,  1.49s/it]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 571/1621 [17:25<26:08,  1.49s/it]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 572/1621 [17:27<25:39,  1.47s/it]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 573/1621 [17:28<26:19,  1.51s/it]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 574/1621 [17:30<26:44,  1.53s/it]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 575/1621 [17:31<26:
0: {'loss': 0.263, 'grad_norm': 0.3045611958885712, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.36}
0: 20,  1.51s/it]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 576/1621 [17:33<26:22,  1.51s/it]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 577/1621 [17:34<27:06,  1.56s/it]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 578/1621 [17:36<27:20,  1.57s/it]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 579/1621 [17:38<26:35,  1.53s/it]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 580/1621 [17:39<26:03,  1.50s/it]
                                                  

 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 580/1621 [17:39<26:03,  1.50s/it]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 581/1621 [17:40<25:33,  1.47s/it]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 582/1621 [17:42<25:19,  1.46s/it]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 583/1621 [17:43<25:56,  1.50s/it]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 584/1621 [17:45<25:47,  1.49s/it]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 585/1621 [17:46<25:23,  1.47s/it]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 586/1621 [17:48<25:18,  1.47s/it]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 587/1621 [17:50<26:48,  1.56s/it]
 36%|β–ˆβ–ˆβ–ˆβ–‹      | 588/1621 [17:51<25:59,  1.51s/it]
 36%|β–ˆβ–ˆβ–ˆβ–‹      | 589/1621 [17:52<25:32,  1.48s/it]
 36%|β–ˆβ–ˆβ–ˆβ–‹      | 590/1621 [17:54<25:09,  1.46s/it]
             
0: {'loss': 0.2654, 'grad_norm': 0.30739048454915474, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.36}
0: {'loss': 0.2681, 'grad_norm': 0.3042454032840815, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.37}
0:                                      

 36%|β–ˆβ–ˆβ–ˆβ–‹      | 590/1621 [17:54<25:09,  1.46s/it]
 36%|β–ˆβ–ˆβ–ˆβ–‹      | 591/1621 [17:55<24:57,  1.45s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 592/1621 [17:57<24:49,  1.45s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 593/1621 [17:58<24:36,  1.44s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 594/1621 [17:59<24:39,  1.44s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 595/1621 [18:01<24:28,  1.43s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 596/1621 [18:02<24:23,  1.43s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 597/1621 [18:04<24:33,  1.44s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 598/1621 [18:05<24:30,  1.44s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 599/1621 [18:07<24:27,  1.44s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 600/1621 [18:08<24:21,  1.43s/it]
                                                  

 37%|β–ˆβ–ˆβ–ˆβ–‹      | 600/1621 [18:08<24:21,  1.43s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 601/1621 [18:10<24:19,  1.43s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 602/1621 [18:11<24:54,  1.47s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 603/1621 [18:12<24:45,  1.46s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 604/1621 [18:14<24:34, 
0: {'loss': 0.2616, 'grad_norm': 0.30618895900501564, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.38}
0:  1.45s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 605/1621 [18:15<24:26,  1.44s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 606/1621 [18:17<24:15,  1.43s/it]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 607/1621 [18:18<24:25,  1.45s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 608/1621 [18:20<24:17,  1.44s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 609/1621 [18:21<25:29,  1.51s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 610/1621 [18:23<25:15,  1.50s/it]
                                                  

 38%|β–ˆβ–ˆβ–ˆβ–Š      | 610/1621 [18:23<25:15,  1.50s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 611/1621 [18:24<25:04,  1.49s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 612/1621 [18:26<25:54,  1.54s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 613/1621 [18:27<25:13,  1.50s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 614/1621 [18:29<24:53,  1.48s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 615/1621 [18:30<24:36,  1.47s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 616/1621 [18:32<24:54,  1.49s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 617/1621 [18:33<24:41,  1.48s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 618/1621 [18:35<24:55,  1.49s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 619/1621 [18:36<24:38,  1.48s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š
0: {'loss': 0.2613, 'grad_norm': 0.2919650844350329, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.38}
0: {'loss': 0.2675, 'grad_norm': 0.3040205468853955, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.39}
0:       | 620/1621 [18:38<24:27,  1.47s/it]
                                                  

 38%|β–ˆβ–ˆβ–ˆβ–Š      | 620/1621 [18:38<24:27,  1.47s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 621/1621 [18:39<24:48,  1.49s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 622/1621 [18:41<24:25,  1.47s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 623/1621 [18:42<24:41,  1.48s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 624/1621 [18:44<24:43,  1.49s/it]
 39%|β–ˆβ–ˆβ–ˆβ–Š      | 625/1621 [18:45<24:14,  1.46s/it]
 39%|β–ˆβ–ˆβ–ˆβ–Š      | 626/1621 [18:46<24:02,  1.45s/it]
 39%|β–ˆβ–ˆβ–ˆβ–Š      | 627/1621 [18:48<23:50,  1.44s/it]
 39%|β–ˆβ–ˆβ–ˆβ–Š      | 628/1621 [18:49<24:02,  1.45s/it]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 629/1621 [18:51<25:00,  1.51s/it]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 630/1621 [18:52<24:30,  1.48s/it]
                                                  

 39%|β–ˆβ–ˆβ–ˆβ–‰      | 630/1621 [18:52<24:30,  1.48s/it]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 631/1621 [18:54<24:21,  1.48s/it]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 632/1621 [18:55<24:03,  1.46s/it]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 633/1621 [18:57<24:11,  1.4
0: {'loss': 0.2699, 'grad_norm': 0.31062647915000946, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.39}
0: 7s/it]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 634/1621 [18:58<25:13,  1.53s/it]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 635/1621 [19:00<24:52,  1.51s/it]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 636/1621 [19:01<24:20,  1.48s/it]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 637/1621 [19:03<24:52,  1.52s/it]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 638/1621 [19:04<24:19,  1.48s/it]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 639/1621 [19:06<23:58,  1.47s/it]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 640/1621 [19:07<23:49,  1.46s/it]
                                                  

 39%|β–ˆβ–ˆβ–ˆβ–‰      | 640/1621 [19:07<23:49,  1.46s/it]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 641/1621 [19:09<24:03,  1.47s/it]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 642/1621 [19:10<23:51,  1.46s/it]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 643/1621 [19:12<23:36,  1.45s/it]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 644/1621 [19:13<23:29,  1.44s/it]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 645/1621 [19:14<23:22,  1.44s/it]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 646/1621 [19:16<23:21,  1.44s/it]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 647/1621 [19:17<24:05,  1.48s/it]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 648/1621 [19:19<24:15,  1.50s/it]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ    
0: {'loss': 0.2618, 'grad_norm': 0.3166654431631163, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.4}
0: [2025-09-02 19:07:12,499] [INFO] [axolotl.core.trainers.base._save:613] [PID:1478787] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0/checkpoint-650
0: [2025-09-02 19:07:17,316] [INFO] [axolotl.core.trainers.base._save:662] [PID:1478787] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
0: {'loss': 0.2676, 'grad_norm': 0.3157907525928946, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.41}
0:   | 649/1621 [19:20<23:51,  1.47s/it]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 650/1621 [19:22<23:59,  1.48s/it]
                                                  

 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 650/1621 [19:22<23:59,  1.48s/it]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 651/1621 [19:32<1:07:48,  4.19s/it]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 652/1621 [19:34<54:15,  3.36s/it]  
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 653/1621 [19:35<45:04,  2.79s/it]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 654/1621 [19:37<38:26,  2.39s/it]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 655/1621 [19:38<33:59,  2.11s/it]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 656/1621 [19:40<30:43,  1.91s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 657/1621 [19:41<29:56,  1.86s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 658/1621 [19:43<27:43,  1.73s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 659/1621 [19:44<26:15,  1.64s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 660/1621 [19:46<25:07,  1.57s/it]
                                                  

 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 660/1621 [19:46<25:07,  1.57s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 661/1621 [19:47<24:20,  1.52s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 662/1621 [19:49<24:15,  1.5
0: {'loss': 0.2635, 'grad_norm': 0.30820407823108603, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.41}
0: 2s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 663/1621 [19:50<23:57,  1.50s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 664/1621 [19:51<23:39,  1.48s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 665/1621 [19:53<24:12,  1.52s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 666/1621 [19:55<23:55,  1.50s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 667/1621 [19:56<25:02,  1.57s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 668/1621 [19:58<25:12,  1.59s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 669/1621 [19:59<24:22,  1.54s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 670/1621 [20:01<23:43,  1.50s/it]
                                                  

 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 670/1621 [20:01<23:43,  1.50s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 671/1621 [20:02<23:20,  1.47s/it]
 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 672/1621 [20:04<23:01,  1.46s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 673/1621 [20:05<22:55,  1.45s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 674/1621 [20:06<22:44,  1.44s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 675/1621 [20:08<22:36,  1.43s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 676/1621 [20:09<22:27,  1.43s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 677/1621 [20:11<23:34,  1.50s/it]
 
0: {'loss': 0.2711, 'grad_norm': 0.3083107428928576, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.42}
0: {'loss': 0.2663, 'grad_norm': 0.3239813055521283, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.43}
0: 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 678/1621 [20:13<24:26,  1.56s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 679/1621 [20:14<23:53,  1.52s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 680/1621 [20:15<23:21,  1.49s/it]
                                                  

 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 680/1621 [20:15<23:21,  1.49s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 681/1621 [20:17<24:32,  1.57s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 682/1621 [20:19<24:05,  1.54s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 683/1621 [20:20<23:51,  1.53s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 684/1621 [20:22<23:35,  1.51s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 685/1621 [20:23<23:05,  1.48s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 686/1621 [20:24<22:43,  1.46s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 687/1621 [20:26<22:29,  1.44s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 688/1621 [20:27<22:28,  1.44s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 689/1621 [20:29<22:17,  1.44s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 690/1621 [20:30<22:19,  1.44s/it]
                                                  

 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 690/1621 [20:30<22:19,  1.44s/it]
 43%|οΏ½
0: {'loss': 0.2664, 'grad_norm': 0.3048360867705897, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.43}
0: οΏ½οΏ½β–ˆβ–ˆβ–ˆβ–Ž     | 691/1621 [20:32<22:08,  1.43s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 692/1621 [20:33<21:59,  1.42s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 693/1621 [20:34<21:54,  1.42s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 694/1621 [20:36<21:49,  1.41s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 695/1621 [20:37<21:46,  1.41s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 696/1621 [20:39<22:08,  1.44s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 697/1621 [20:40<22:00,  1.43s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 698/1621 [20:41<21:51,  1.42s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 699/1621 [20:43<21:43,  1.41s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 700/1621 [20:44<21:44,  1.42s/it]
                                                  

 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 700/1621 [20:44<21:44,  1.42s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 701/1621 [20:46<21:50,  1.42s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 702/1621 [20:47<21:53,  1.43s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 703/1621 [20:49<21:51,  1.43s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 704/1621 [20:50<21:44,  1.42s/it]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 705/1621 [20:51<21:42,  1.42s/it]
 4
0: {'loss': 0.264, 'grad_norm': 0.2919821903782183, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.44}
0: 4%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 706/1621 [20:53<22:05,  1.45s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 707/1621 [20:55<23:19,  1.53s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 708/1621 [20:56<22:46,  1.50s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 709/1621 [20:58<22:52,  1.50s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 710/1621 [20:59<22:28,  1.48s/it]
                                                  

 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 710/1621 [20:59<22:28,  1.48s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 711/1621 [21:01<23:08,  1.53s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 712/1621 [21:02<22:43,  1.50s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 713/1621 [21:04<22:17,  1.47s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 714/1621 [21:05<21:58,  1.45s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 715/1621 [21:06<22:07,  1.46s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 716/1621 [21:08<22:28,  1.49s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 717/1621 [21:09<22:01,  1.46s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 718/1621 [21:11<21:58,  1.46s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 719/1621 [21:12<21:50,  1.45s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 720/1621 [21:14<21:39,  1.44s/it
0: {'loss': 0.2624, 'grad_norm': 0.30328239047498634, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.44}
0: {'loss': 0.2642, 'grad_norm': 0.29982742337438895, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.45}
0: ]
                                                  

 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 720/1621 [21:14<21:39,  1.44s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 721/1621 [21:15<21:48,  1.45s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 722/1621 [21:17<21:43,  1.45s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 723/1621 [21:18<22:12,  1.48s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 724/1621 [21:20<21:53,  1.46s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 725/1621 [21:21<21:42,  1.45s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 726/1621 [21:22<21:32,  1.44s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 727/1621 [21:24<21:26,  1.44s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 728/1621 [21:25<21:19,  1.43s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 729/1621 [21:27<21:27,  1.44s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 730/1621 [21:28<21:19,  1.44s/it]
                                                  

 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 730/1621 [21:28<21:19,  1.44s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 731/1621 [21:30<21:17,  1.44s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 732/1621 [21:31<21:09,  1.43s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 733/1621 [21:33<21:36,  1.46s/it]
 45
0: {'loss': 0.2673, 'grad_norm': 0.30890762426068, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.46}
0: %|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 734/1621 [21:34<22:02,  1.49s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 735/1621 [21:36<22:31,  1.53s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 736/1621 [21:37<22:03,  1.50s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 737/1621 [21:39<21:43,  1.47s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 738/1621 [21:40<21:43,  1.48s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 739/1621 [21:42<22:17,  1.52s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 740/1621 [21:43<21:51,  1.49s/it]
                                                  

 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 740/1621 [21:43<21:51,  1.49s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 741/1621 [21:45<21:39,  1.48s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 742/1621 [21:46<21:27,  1.46s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 743/1621 [21:47<21:11,  1.45s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 744/1621 [21:49<20:59,  1.44s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 745/1621 [21:50<20:49,  1.43s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 746/1621 [21:52<20:45,  1.42s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 747/1621 [21:53<20:59,  1.44s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 748/1621 [21:55<20:57,  1.44s/it]
0: {'loss': 0.2591, 'grad_norm': 0.30747963031394887, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.46}
0: {'loss': 0.2642, 'grad_norm': 0.30645470994710144, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.47}
0: 
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 749/1621 [21:56<20:53,  1.44s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 750/1621 [21:57<20:47,  1.43s/it]
                                                  

 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 750/1621 [21:57<20:47,  1.43s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 751/1621 [21:59<21:09,  1.46s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 752/1621 [22:00<21:08,  1.46s/it]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 753/1621 [22:02<20:57,  1.45s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 754/1621 [22:03<20:46,  1.44s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 755/1621 [22:05<20:42,  1.43s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 756/1621 [22:06<20:59,  1.46s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 757/1621 [22:08<20:53,  1.45s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 758/1621 [22:09<21:05,  1.47s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 759/1621 [22:10<20:54,  1.45s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 760/1621 [22:12<20:42,  1.44s/it]
                                                  

 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 760/1621 [22:12<20:42,  1.44s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 761/1621 [22:13<20:34,  1.44s/it]
 47%
0: {'loss': 0.2636, 'grad_norm': 0.3113139232378856, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.48}
0: |β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 762/1621 [22:15<21:09,  1.48s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 763/1621 [22:16<20:51,  1.46s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 764/1621 [22:18<21:57,  1.54s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 765/1621 [22:19<21:21,  1.50s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 766/1621 [22:21<21:06,  1.48s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 767/1621 [22:22<20:53,  1.47s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 768/1621 [22:24<20:52,  1.47s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 769/1621 [22:25<20:38,  1.45s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 770/1621 [22:27<20:30,  1.45s/it]
                                                  

 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 770/1621 [22:27<20:30,  1.45s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 771/1621 [22:28<20:23,  1.44s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 772/1621 [22:29<20:21,  1.44s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 773/1621 [22:31<20:30,  1.45s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 774/1621 [22:32<20:20,  1.44s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 775/1621 [22:34<20:38,  1.46s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 776/1621 [22:35<20:32,  1.46s/it]
0: {'loss': 0.2548, 'grad_norm': 0.29543505645103424, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.48}
0: {'loss': 0.2602, 'grad_norm': 0.29282257246610377, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.49}
0:  48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 777/1621 [22:37<20:23,  1.45s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 778/1621 [22:38<21:04,  1.50s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 779/1621 [22:40<20:51,  1.49s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 780/1621 [22:41<20:45,  1.48s/it]
                                                  

 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 780/1621 [22:41<20:45,  1.48s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 781/1621 [22:43<20:41,  1.48s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 782/1621 [22:44<20:28,  1.46s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 783/1621 [22:46<20:15,  1.45s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 784/1621 [22:47<21:18,  1.53s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 785/1621 [22:49<20:52,  1.50s/it]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 786/1621 [22:50<20:43,  1.49s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 787/1621 [22:52<20:25,  1.47s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 788/1621 [22:53<20:27,  1.47s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 789/1621 [22:55<20:18,  1.46s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 790/1621 [22:56<20:58,  1.51s/it]
                                                  

 49%|
0: {'loss': 0.2561, 'grad_norm': 0.30099682495532426, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.49}
0: β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 790/1621 [22:56<20:58,  1.51s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 791/1621 [22:58<20:33,  1.49s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 792/1621 [22:59<21:12,  1.53s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 793/1621 [23:01<22:04,  1.60s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 794/1621 [23:02<21:14,  1.54s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 795/1621 [23:04<21:02,  1.53s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 796/1621 [23:05<20:32,  1.49s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 797/1621 [23:07<20:20,  1.48s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 798/1621 [23:08<20:50,  1.52s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 799/1621 [23:10<20:49,  1.52s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 800/1621 [23:11<20:41,  1.51s/it]
                                                  

 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 800/1621 [23:11<20:41,  1.51s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 801/1621 [23:13<21:04,  1.54s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 802/1621 [23:15<20:48,  1.52s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 803/1621 [23:16<20:39,  1.52s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 804/1621 [23:17<20:16,  1.49s/it]
 
0: {'loss': 0.2608, 'grad_norm': 0.28990308697565914, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.5}
0: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 805/1621 [23:19<19:57,  1.47s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 806/1621 [23:20<19:40,  1.45s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 807/1621 [23:22<19:49,  1.46s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 808/1621 [23:23<19:48,  1.46s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 809/1621 [23:25<19:38,  1.45s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 810/1621 [23:26<19:42,  1.46s/it]
                                                  

 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 810/1621 [23:26<19:42,  1.46s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 811/1621 [23:28<19:33,  1.45s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 812/1621 [23:29<19:29,  1.45s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 813/1621 [23:30<19:39,  1.46s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 814/1621 [23:32<19:42,  1.47s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 815/1621 [23:33<19:50,  1.48s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 816/1621 [23:35<19:37,  1.46s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 817/1621 [23:36<19:29,  1.45s/it]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 818/1621 [23:38<19:23,  1.45s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 819/1621 [23:39<20:23,  1.53s/i
0: {'loss': 0.2593, 'grad_norm': 0.30472493668896977, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.51}
0: {'loss': 0.2672, 'grad_norm': 0.32931684813090084, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.36, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.51}
0: t]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 820/1621 [23:41<19:58,  1.50s/it]
                                                  

 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 820/1621 [23:41<19:58,  1.50s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 821/1621 [23:42<20:06,  1.51s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 822/1621 [23:44<19:41,  1.48s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 823/1621 [23:45<19:26,  1.46s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 824/1621 [23:47<19:14,  1.45s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 825/1621 [23:48<19:58,  1.51s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 826/1621 [23:50<19:46,  1.49s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 827/1621 [23:51<19:27,  1.47s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 828/1621 [23:53<19:17,  1.46s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 829/1621 [23:54<19:03,  1.44s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 830/1621 [23:55<18:52,  1.43s/it]
                                                  

 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 830/1621 [23:55<18:52,  1.43s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 831/1621 [23:57<18:54,  1.44s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 832/1621 [23:58<18:47,  1.43s/it
0: {'loss': 0.2649, 'grad_norm': 0.3394419476608706, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.52}
0: ]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 833/1621 [24:00<18:41,  1.42s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 834/1621 [24:01<18:41,  1.43s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 835/1621 [24:03<19:05,  1.46s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 836/1621 [24:04<19:00,  1.45s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 837/1621 [24:06<19:07,  1.46s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 838/1621 [24:07<20:22,  1.56s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 839/1621 [24:09<20:50,  1.60s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 840/1621 [24:11<20:13,  1.55s/it]
                                                  

 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 840/1621 [24:11<20:13,  1.55s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 841/1621 [24:12<20:25,  1.57s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 842/1621 [24:14<19:56,  1.54s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 843/1621 [24:15<19:25,  1.50s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 844/1621 [24:16<19:02,  1.47s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 845/1621 [24:18<19:54,  1.54s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 846/1621 [24:20<19:29,  1.51s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  
0: {'loss': 0.2671, 'grad_norm': 0.2952695242398675, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.52}
0: {'loss': 0.264, 'grad_norm': 0.29373714171314075, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.53}
0:   | 847/1621 [24:21<19:08,  1.48s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 848/1621 [24:22<18:48,  1.46s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 849/1621 [24:24<19:33,  1.52s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 850/1621 [24:25<19:12,  1.49s/it]
                                                  

 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 850/1621 [24:25<19:12,  1.49s/it]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 851/1621 [24:27<18:59,  1.48s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 852/1621 [24:28<18:47,  1.47s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 853/1621 [24:30<18:50,  1.47s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 854/1621 [24:31<18:51,  1.48s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 855/1621 [24:33<19:00,  1.49s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 856/1621 [24:34<18:42,  1.47s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 857/1621 [24:36<18:31,  1.45s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 858/1621 [24:37<18:22,  1.44s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 859/1621 [24:39<18:13,  1.44s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 860/1621 [24:40<18:07,  1.43s/it]
                                                  

0: {'loss': 0.2591, 'grad_norm': 0.2933249021776934, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.54}
0:  53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 860/1621 [24:40<18:07,  1.43s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 861/1621 [24:41<18:09,  1.43s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 862/1621 [24:43<18:04,  1.43s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 863/1621 [24:44<18:10,  1.44s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 864/1621 [24:46<18:52,  1.50s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 865/1621 [24:47<18:33,  1.47s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 866/1621 [24:49<18:23,  1.46s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 867/1621 [24:50<18:17,  1.46s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 868/1621 [24:52<18:29,  1.47s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 869/1621 [24:53<18:18,  1.46s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 870/1621 [24:55<19:23,  1.55s/it]
                                                  

 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 870/1621 [24:55<19:23,  1.55s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 871/1621 [24:56<18:54,  1.51s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 872/1621 [24:58<18:38,  1.49s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 873/1621 [24:59<18:37,  1.49s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    
0: {'loss': 0.2644, 'grad_norm': 0.30874390029265875, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.54}
0: | 874/1621 [25:01<18:27,  1.48s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 875/1621 [25:02<18:10,  1.46s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 876/1621 [25:04<18:33,  1.49s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 877/1621 [25:05<18:22,  1.48s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 878/1621 [25:07<18:09,  1.47s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 879/1621 [25:08<18:07,  1.47s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 880/1621 [25:09<17:55,  1.45s/it]
                                                  

 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 880/1621 [25:09<17:55,  1.45s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 881/1621 [25:11<17:48,  1.44s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 882/1621 [25:12<18:13,  1.48s/it]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 883/1621 [25:14<18:00,  1.46s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 884/1621 [25:15<17:47,  1.45s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 885/1621 [25:17<17:38,  1.44s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 886/1621 [25:18<17:44,  1.45s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 887/1621 [25:20<17:38,  1.44s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 888/1621 [25:21<18:40,  1
0: {'loss': 0.2605, 'grad_norm': 0.334940089088556, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.55}
0: {'loss': 0.2684, 'grad_norm': 0.3085779318464317, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.56}
0: .53s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 889/1621 [25:23<18:22,  1.51s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 890/1621 [25:24<18:20,  1.51s/it]
                                                  

 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 890/1621 [25:24<18:20,  1.51s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 891/1621 [25:26<18:14,  1.50s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 892/1621 [25:27<17:59,  1.48s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 893/1621 [25:29<17:53,  1.47s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 894/1621 [25:30<17:41,  1.46s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 895/1621 [25:32<17:28,  1.44s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 896/1621 [25:33<17:17,  1.43s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 897/1621 [25:34<17:17,  1.43s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 898/1621 [25:36<17:12,  1.43s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 899/1621 [25:37<17:08,  1.42s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 900/1621 [25:39<18:02,  1.50s/it]
                                                  

 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 900/1621 [25:39<18:02,  1.50s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 
0: {'loss': 0.2617, 'grad_norm': 0.3477064834521338, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.56}
0: 901/1621 [25:40<17:53,  1.49s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 902/1621 [25:42<17:34,  1.47s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 903/1621 [25:43<17:22,  1.45s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 904/1621 [25:45<17:11,  1.44s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 905/1621 [25:46<17:11,  1.44s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 906/1621 [25:47<17:05,  1.43s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 907/1621 [25:49<17:01,  1.43s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 908/1621 [25:50<17:02,  1.43s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 909/1621 [25:52<16:57,  1.43s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 910/1621 [25:53<16:56,  1.43s/it]
                                                  

 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 910/1621 [25:53<16:56,  1.43s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 911/1621 [25:55<17:24,  1.47s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 912/1621 [25:56<17:12,  1.46s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 913/1621 [25:58<17:01,  1.44s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 914/1621 [25:59<16:54,  1.43s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 915/1621 [26:00<16:51,  1.4
0: {'loss': 0.2648, 'grad_norm': 0.30501036108316015, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.57}
0: 3s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 916/1621 [26:02<16:43,  1.42s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 917/1621 [26:03<16:37,  1.42s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 918/1621 [26:05<16:36,  1.42s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 919/1621 [26:06<16:35,  1.42s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 920/1621 [26:07<16:36,  1.42s/it]
                                                  

 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 920/1621 [26:07<16:36,  1.42s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 921/1621 [26:09<16:35,  1.42s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 922/1621 [26:10<16:34,  1.42s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 923/1621 [26:12<16:43,  1.44s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 924/1621 [26:13<16:43,  1.44s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 925/1621 [26:15<17:29,  1.51s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 926/1621 [26:16<17:09,  1.48s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 927/1621 [26:18<17:02,  1.47s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 928/1621 [26:19<16:50,  1.46s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 929/1621 [26:21<16:44,  1.45s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
0: {'loss': 0.2639, 'grad_norm': 0.29127250759681794, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.57}
0: {'loss': 0.2578, 'grad_norm': 0.32113325015435573, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.58}
0: β–‹    | 930/1621 [26:22<16:36,  1.44s/it]
                                                  

 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 930/1621 [26:22<16:36,  1.44s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 931/1621 [26:23<16:33,  1.44s/it]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 932/1621 [26:25<16:27,  1.43s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 933/1621 [26:26<16:49,  1.47s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 934/1621 [26:28<16:55,  1.48s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 935/1621 [26:29<16:39,  1.46s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 936/1621 [26:31<16:48,  1.47s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 937/1621 [26:33<17:21,  1.52s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 938/1621 [26:34<17:11,  1.51s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 939/1621 [26:35<17:02,  1.50s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 940/1621 [26:37<16:43,  1.47s/it]
                                                  

 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 940/1621 [26:37<16:43,  1.47s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 941/1621 [26:39<17:13,  1.52s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 942/1621 [26:40<17:13,  1.52s
0: {'loss': 0.2557, 'grad_norm': 0.29622986437781335, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.59}
0: /it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 943/1621 [26:41<16:51,  1.49s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 944/1621 [26:43<16:36,  1.47s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 945/1621 [26:44<16:31,  1.47s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 946/1621 [26:46<16:23,  1.46s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 947/1621 [26:47<16:17,  1.45s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 948/1621 [26:49<16:09,  1.44s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 949/1621 [26:50<16:20,  1.46s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 950/1621 [26:52<16:26,  1.47s/it]
                                                  

 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 950/1621 [26:52<16:26,  1.47s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 951/1621 [26:53<16:15,  1.46s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 952/1621 [26:54<16:06,  1.45s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 953/1621 [26:56<16:42,  1.50s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 954/1621 [26:58<16:25,  1.48s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 955/1621 [26:59<16:13,  1.46s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 956/1621 [27:00<16:10,  1.46s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.2527, 'grad_norm': 0.3334298807383501, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.59}
0: οΏ½    | 957/1621 [27:02<16:37,  1.50s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 958/1621 [27:04<16:36,  1.50s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 959/1621 [27:05<17:28,  1.58s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 960/1621 [27:07<17:25,  1.58s/it]
                                                  

 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 960/1621 [27:07<17:25,  1.58s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 961/1621 [27:08<16:47,  1.53s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 962/1621 [27:10<16:25,  1.50s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 963/1621 [27:11<16:13,  1.48s/it]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 964/1621 [27:13<16:13,  1.48s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 965/1621 [27:14<16:01,  1.47s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 966/1621 [27:15<15:51,  1.45s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 967/1621 [27:17<15:45,  1.45s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 968/1621 [27:18<15:38,  1.44s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 969/1621 [27:20<15:41,  1.44s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 970/1621 [27:21<15:37,  1.44s/it]
                                                 
0: {'loss': 0.2671, 'grad_norm': 0.3152370154555651, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.6}
0: [2025-09-02 19:15:19,083] [INFO] [axolotl.core.trainers.base._save:613] [PID:1478787] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0/checkpoint-975
0: [2025-09-02 19:15:23,850] [INFO] [axolotl.core.trainers.base._save:662] [PID:1478787] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
0: {'loss': 0.2586, 'grad_norm': 0.2761551896866444, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.6}
0:  

 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 970/1621 [27:21<15:37,  1.44s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 971/1621 [27:23<15:29,  1.43s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 972/1621 [27:24<15:25,  1.43s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 973/1621 [27:25<15:22,  1.42s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 974/1621 [27:27<15:26,  1.43s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 975/1621 [27:28<15:25,  1.43s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 976/1621 [27:39<45:02,  4.19s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 977/1621 [27:40<36:09,  3.37s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 978/1621 [27:42<29:54,  2.79s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 979/1621 [27:43<25:37,  2.39s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 980/1621 [27:45<22:35,  2.11s/it]
                                                  

 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 980/1621 [27:45<22:35,  2.11s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 981/1621 [27:46<20:20,  1.91s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 982/1621 [27:48<18:46,  1.76s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 983/1621 [27:49<17:43,  1.67s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 
0: {'loss': 0.2666, 'grad_norm': 0.3193029520192468, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.61}
0:    | 984/1621 [27:51<17:39,  1.66s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 985/1621 [27:52<16:48,  1.59s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 986/1621 [27:54<16:12,  1.53s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 987/1621 [27:55<15:53,  1.50s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 988/1621 [27:56<15:32,  1.47s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 989/1621 [27:58<16:08,  1.53s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 990/1621 [28:00<16:21,  1.56s/it]
                                                  

 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 990/1621 [28:00<16:21,  1.56s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 991/1621 [28:01<15:53,  1.51s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 992/1621 [28:03<15:45,  1.50s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 993/1621 [28:04<15:27,  1.48s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 994/1621 [28:05<15:21,  1.47s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 995/1621 [28:07<15:09,  1.45s/it]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 996/1621 [28:08<15:00,  1.44s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 997/1621 [28:10<15:00,  1.44s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 998/1621 [
0: {'loss': 0.2576, 'grad_norm': 0.32033329683867157, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.62}
0: {'loss': 0.2554, 'grad_norm': 0.324086242099498, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.62}
0: 28:11<15:41,  1.51s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 999/1621 [28:13<15:23,  1.48s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1000/1621 [28:14<15:08,  1.46s/it]
                                                   

 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1000/1621 [28:14<15:08,  1.46s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1001/1621 [28:16<14:59,  1.45s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1002/1621 [28:17<14:49,  1.44s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1003/1621 [28:18<14:42,  1.43s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1004/1621 [28:20<15:49,  1.54s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1005/1621 [28:22<15:51,  1.55s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1006/1621 [28:23<15:26,  1.51s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1007/1621 [28:25<15:20,  1.50s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1008/1621 [28:26<15:08,  1.48s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1009/1621 [28:28<14:58,  1.47s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1010/1621 [28:29<14:46,  1.45s/it]
                                                   

 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1010/1
0: {'loss': 0.2586, 'grad_norm': 0.2985933975674141, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.63}
0: 621 [28:29<14:46,  1.45s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1011/1621 [28:30<14:37,  1.44s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1012/1621 [28:32<14:59,  1.48s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1013/1621 [28:33<14:49,  1.46s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1014/1621 [28:35<14:49,  1.46s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1015/1621 [28:36<14:39,  1.45s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1016/1621 [28:38<14:32,  1.44s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1017/1621 [28:39<14:25,  1.43s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1018/1621 [28:41<14:34,  1.45s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1019/1621 [28:42<14:56,  1.49s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1020/1621 [28:44<14:43,  1.47s/it]
                                                   

 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1020/1621 [28:44<14:43,  1.47s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1021/1621 [28:45<14:32,  1.45s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1022/1621 [28:46<14:31,  1.46s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1023/1621 [28:48<14:39,  1.47s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.2678, 'grad_norm': 0.3133514681538571, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.64}
0: οΏ½οΏ½β–ˆβ–Ž   | 1024/1621 [28:49<14:33,  1.46s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1025/1621 [28:51<14:32,  1.46s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1026/1621 [28:52<14:24,  1.45s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1027/1621 [28:54<14:16,  1.44s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1028/1621 [28:55<14:16,  1.45s/it]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1029/1621 [28:57<14:13,  1.44s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1030/1621 [28:58<14:04,  1.43s/it]
                                                   

 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1030/1621 [28:58<14:04,  1.43s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1031/1621 [28:59<14:10,  1.44s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1032/1621 [29:01<14:05,  1.44s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 1033/1621 [29:02<14:04,  1.44s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1034/1621 [29:04<14:03,  1.44s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1035/1621 [29:05<14:03,  1.44s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1036/1621 [29:07<14:11,  1.46s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1037/1621 [29:08<14:32,  1.49s/it]
0: {'loss': 0.259, 'grad_norm': 0.29610628515170656, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.64}
0: 
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1038/1621 [29:10<14:23,  1.48s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1039/1621 [29:11<14:11,  1.46s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1040/1621 [29:13<14:00,  1.45s/it]
                                                   

 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1040/1621 [29:13<14:00,  1.45s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1041/1621 [29:14<13:53,  1.44s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1042/1621 [29:16<14:05,  1.46s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1043/1621 [29:17<13:57,  1.45s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1044/1621 [29:18<14:04,  1.46s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1045/1621 [29:20<14:40,  1.53s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1046/1621 [29:22<14:23,  1.50s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1047/1621 [29:23<14:32,  1.52s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1048/1621 [29:25<14:13,  1.49s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1049/1621 [29:26<14:21,  1.51s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1050/1621 [29:28<15:13,  1.60s/it]
                                              
0: {'loss': 0.2581, 'grad_norm': 0.36909157148741467, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.65}
0: {'loss': 0.2554, 'grad_norm': 0.32556652711144385, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 42.93, 'epoch': 0.65}
0:      

 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1050/1621 [29:28<15:13,  1.60s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1051/1621 [29:30<15:45,  1.66s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1052/1621 [29:31<15:01,  1.59s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1053/1621 [29:33<15:04,  1.59s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1054/1621 [29:34<14:34,  1.54s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1055/1621 [29:36<14:18,  1.52s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1056/1621 [29:37<14:49,  1.57s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1057/1621 [29:39<14:41,  1.56s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1058/1621 [29:40<14:14,  1.52s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1059/1621 [29:42<13:53,  1.48s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1060/1621 [29:43<13:45,  1.47s/it]
                                                   

 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1060/1621 [29:43<13:45,  1.47s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1061/1621 [29:45<13:37,  1.46s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1062/1621 [29:46<13:38,  1.46s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1063/1621
0: {'loss': 0.2549, 'grad_norm': 0.32018891136312116, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.66}
0:  [29:48<13:44,  1.48s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1064/1621 [29:49<14:10,  1.53s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1065/1621 [29:51<13:50,  1.49s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1066/1621 [29:52<14:10,  1.53s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1067/1621 [29:54<13:48,  1.50s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1068/1621 [29:55<13:34,  1.47s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1069/1621 [29:57<13:34,  1.47s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1070/1621 [29:58<13:23,  1.46s/it]
                                                   

 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1070/1621 [29:58<13:23,  1.46s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1071/1621 [29:59<13:20,  1.46s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1072/1621 [30:01<13:12,  1.44s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1073/1621 [30:02<13:07,  1.44s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1074/1621 [30:04<13:01,  1.43s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1075/1621 [30:05<13:48,  1.52s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1076/1621 [30:07<13:40,  1.51s/it]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.2636, 'grad_norm': 0.3165670706179018, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.67}
0: οΏ½οΏ½β–‹   | 1077/1621 [30:08<13:31,  1.49s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1078/1621 [30:10<13:19,  1.47s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1079/1621 [30:11<13:10,  1.46s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1080/1621 [30:13<13:01,  1.44s/it]
                                                   

 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1080/1621 [30:13<13:01,  1.44s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1081/1621 [30:14<12:55,  1.44s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1082/1621 [30:15<12:49,  1.43s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1083/1621 [30:17<13:35,  1.52s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1084/1621 [30:19<13:55,  1.56s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1085/1621 [30:20<13:34,  1.52s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1086/1621 [30:22<13:16,  1.49s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1087/1621 [30:23<13:42,  1.54s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1088/1621 [30:25<13:28,  1.52s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1089/1621 [30:26<13:11,  1.49s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1090/1621 [30:28<13:34,  1.53s/it]
  
0: {'loss': 0.2623, 'grad_norm': 0.32402128653847556, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.67}
0: {'loss': 0.2547, 'grad_norm': 0.3385641999618941, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.68}
0:                                                  

 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1090/1621 [30:28<13:34,  1.53s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1091/1621 [30:29<13:20,  1.51s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1092/1621 [30:31<13:19,  1.51s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1093/1621 [30:32<13:03,  1.48s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1094/1621 [30:34<12:51,  1.46s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1095/1621 [30:35<12:43,  1.45s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1096/1621 [30:37<13:12,  1.51s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1097/1621 [30:38<13:12,  1.51s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1098/1621 [30:40<12:59,  1.49s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1099/1621 [30:41<13:02,  1.50s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1100/1621 [30:43<12:51,  1.48s/it]
                                                   

 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1100/1621 [30:43<12:51,  1.48s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1101/1621 [30:44<13:10,  1.52s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1102/1621 [30:46<12:59,  1.50s/
0: {'loss': 0.2559, 'grad_norm': 0.30780759695128085, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.68}
0: it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1103/1621 [30:47<12:48,  1.48s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1104/1621 [30:49<12:43,  1.48s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1105/1621 [30:50<12:32,  1.46s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1106/1621 [30:51<12:28,  1.45s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1107/1621 [30:53<12:26,  1.45s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1108/1621 [30:54<12:29,  1.46s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1109/1621 [30:56<12:28,  1.46s/it]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1110/1621 [30:57<12:19,  1.45s/it]
                                                   

 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1110/1621 [30:57<12:19,  1.45s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1111/1621 [30:59<12:26,  1.46s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1112/1621 [31:00<12:24,  1.46s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1113/1621 [31:02<12:36,  1.49s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1114/1621 [31:03<12:34,  1.49s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1115/1621 [31:05<13:06,  1.55s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1116/1621 [3
0: {'loss': 0.256, 'grad_norm': 0.3035060051707476, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.69}
0: 1:06<12:51,  1.53s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1117/1621 [31:08<12:41,  1.51s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1118/1621 [31:09<12:27,  1.49s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1119/1621 [31:11<12:19,  1.47s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1120/1621 [31:12<12:11,  1.46s/it]
                                                   

 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1120/1621 [31:12<12:11,  1.46s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1121/1621 [31:14<12:18,  1.48s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1122/1621 [31:15<12:12,  1.47s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1123/1621 [31:17<12:08,  1.46s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1124/1621 [31:18<12:06,  1.46s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1125/1621 [31:20<12:06,  1.46s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1126/1621 [31:21<11:59,  1.45s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1127/1621 [31:22<11:57,  1.45s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1128/1621 [31:24<11:59,  1.46s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1129/1621 [31:25<12:04,  1.47s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.2548, 'grad_norm': 0.30070942936522693, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.7}
0: {'loss': 0.2575, 'grad_norm': 0.31196101931665887, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.7}
0: οΏ½οΏ½   | 1130/1621 [31:27<11:56,  1.46s/it]
                                                   

 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1130/1621 [31:27<11:56,  1.46s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1131/1621 [31:28<11:48,  1.45s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1132/1621 [31:30<11:46,  1.44s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1133/1621 [31:31<12:06,  1.49s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1134/1621 [31:33<11:55,  1.47s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1135/1621 [31:34<11:49,  1.46s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1136/1621 [31:36<11:41,  1.45s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1137/1621 [31:37<12:35,  1.56s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1138/1621 [31:39<12:14,  1.52s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1139/1621 [31:40<12:05,  1.51s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1140/1621 [31:42<11:53,  1.48s/it]
                                                   

 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1140/1621 [31:42<11:53,  1.48s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1141/1621 [31:43<11:49,  1.48s/it]
 70%|β–ˆβ–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.2523, 'grad_norm': 0.3316408275634874, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.71}
0: οΏ½οΏ½β–ˆβ–ˆ   | 1142/1621 [31:45<12:06,  1.52s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1143/1621 [31:46<12:37,  1.59s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1144/1621 [31:48<12:16,  1.54s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1145/1621 [31:49<11:56,  1.51s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1146/1621 [31:51<11:46,  1.49s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1147/1621 [31:52<11:41,  1.48s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1148/1621 [31:54<12:11,  1.55s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1149/1621 [31:55<11:52,  1.51s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1150/1621 [31:57<12:05,  1.54s/it]
                                                   

 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1150/1621 [31:57<12:05,  1.54s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1151/1621 [31:59<12:14,  1.56s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1152/1621 [32:00<12:02,  1.54s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1153/1621 [32:02<11:55,  1.53s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1154/1621 [32:03<12:03,  1.55s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1155/1621 [32:05<11:42,  1.51s/i
0: {'loss': 0.2534, 'grad_norm': 0.31318013713966153, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.72}
0: t]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1156/1621 [32:06<11:27,  1.48s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1157/1621 [32:07<11:24,  1.48s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1158/1621 [32:09<11:15,  1.46s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1159/1621 [32:10<11:12,  1.46s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1160/1621 [32:12<11:08,  1.45s/it]
                                                   

 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1160/1621 [32:12<11:08,  1.45s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1161/1621 [32:13<11:02,  1.44s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1162/1621 [32:15<11:02,  1.44s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1163/1621 [32:16<10:56,  1.43s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1164/1621 [32:18<10:55,  1.43s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1165/1621 [32:19<10:55,  1.44s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1166/1621 [32:20<10:51,  1.43s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1167/1621 [32:22<10:47,  1.43s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1168/1621 [32:23<10:48,  1.43s/it]
 72%|β–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.2614, 'grad_norm': 0.3167000616112638, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.72}
0: {'loss': 0.2543, 'grad_norm': 0.314470839342128, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.73}
0: οΏ½β–ˆβ–ˆβ–ˆβ–  | 1169/1621 [32:25<10:48,  1.43s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1170/1621 [32:26<11:20,  1.51s/it]
                                                   

 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1170/1621 [32:26<11:20,  1.51s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1171/1621 [32:28<11:14,  1.50s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1172/1621 [32:29<11:02,  1.48s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1173/1621 [32:31<10:51,  1.46s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1174/1621 [32:32<10:48,  1.45s/it]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1175/1621 [32:34<10:55,  1.47s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1176/1621 [32:35<10:54,  1.47s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1177/1621 [32:37<10:48,  1.46s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1178/1621 [32:38<10:45,  1.46s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1179/1621 [32:40<11:16,  1.53s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1180/1621 [32:41<11:05,  1.51s/it]
                                                   

 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1180/1621 [32:41<
0: {'loss': 0.2612, 'grad_norm': 0.2997645116358892, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.73}
0: 11:05,  1.51s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1181/1621 [32:43<10:51,  1.48s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1182/1621 [32:44<10:42,  1.46s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1183/1621 [32:45<10:35,  1.45s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1184/1621 [32:47<10:34,  1.45s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1185/1621 [32:48<10:30,  1.45s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1186/1621 [32:50<10:27,  1.44s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1187/1621 [32:51<10:22,  1.43s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1188/1621 [32:53<10:24,  1.44s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1189/1621 [32:54<10:20,  1.44s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1190/1621 [32:55<10:15,  1.43s/it]
                                                   

 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1190/1621 [32:55<10:15,  1.43s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1191/1621 [32:57<10:13,  1.43s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1192/1621 [32:58<10:37,  1.49s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1193/1621 [33:00<10:28,  1.47s/it]
 
0: {'loss': 0.2579, 'grad_norm': 0.3118112550629538, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.74}
0: 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1194/1621 [33:01<10:21,  1.46s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1195/1621 [33:03<10:15,  1.44s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1196/1621 [33:04<10:12,  1.44s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1197/1621 [33:06<10:06,  1.43s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1198/1621 [33:07<10:06,  1.43s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1199/1621 [33:08<10:04,  1.43s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1200/1621 [33:10<10:00,  1.43s/it]
                                                   

 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1200/1621 [33:10<10:00,  1.43s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1201/1621 [33:11<10:01,  1.43s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1202/1621 [33:13<09:57,  1.42s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1203/1621 [33:14<10:04,  1.45s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1204/1621 [33:16<10:05,  1.45s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1205/1621 [33:17<10:05,  1.46s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1206/1621 [33:19<09:58,  1.44s/it]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
0: {'loss': 0.2595, 'grad_norm': 0.3115748471949704, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.75}
0: β–ˆβ–ˆβ–  | 1207/1621 [33:20<09:53,  1.43s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1208/1621 [33:21<09:50,  1.43s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1209/1621 [33:23<09:48,  1.43s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1210/1621 [33:24<09:43,  1.42s/it]
                                                   

 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1210/1621 [33:24<09:43,  1.42s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1211/1621 [33:26<09:43,  1.42s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1212/1621 [33:27<09:42,  1.42s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1213/1621 [33:29<09:41,  1.43s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1214/1621 [33:30<09:38,  1.42s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1215/1621 [33:31<09:47,  1.45s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1216/1621 [33:33<09:43,  1.44s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1217/1621 [33:34<09:38,  1.43s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1218/1621 [33:36<09:39,  1.44s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1219/1621 [33:37<09:35,  1.43s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1220/1
0: {'loss': 0.2622, 'grad_norm': 0.3260517255087193, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.75}
0: {'loss': 0.2566, 'grad_norm': 0.29314960466358514, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.76}
0: 621 [33:39<09:34,  1.43s/it]
                                                   

 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1220/1621 [33:39<09:34,  1.43s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1221/1621 [33:40<09:37,  1.44s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1222/1621 [33:41<09:32,  1.43s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1223/1621 [33:43<09:30,  1.43s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1224/1621 [33:44<09:37,  1.46s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1225/1621 [33:46<09:30,  1.44s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1226/1621 [33:47<09:26,  1.43s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1227/1621 [33:49<09:29,  1.45s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1228/1621 [33:50<09:24,  1.44s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1229/1621 [33:52<09:18,  1.43s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1230/1621 [33:53<09:18,  1.43s/it]
                                                   

 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1230/1621 [33:53<09:18,  1.43s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1231/1621 [33:54<09:20,  1.44s/it]
 76%|
0: {'loss': 0.2569, 'grad_norm': 0.3515763746212918, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.76}
0: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1232/1621 [33:56<09:17,  1.43s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1233/1621 [33:57<09:21,  1.45s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1234/1621 [33:59<09:20,  1.45s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1235/1621 [34:00<09:14,  1.44s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1236/1621 [34:02<09:10,  1.43s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1237/1621 [34:03<09:27,  1.48s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1238/1621 [34:05<09:22,  1.47s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1239/1621 [34:06<09:17,  1.46s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1240/1621 [34:07<09:11,  1.45s/it]
                                                   

 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1240/1621 [34:07<09:11,  1.45s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1241/1621 [34:09<09:08,  1.44s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1242/1621 [34:10<09:05,  1.44s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1243/1621 [34:12<09:03,  1.44s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1244/1621 [34:13<09:07,  1.45s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.2568, 'grad_norm': 0.318913671719499, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.77}
0: οΏ½οΏ½β–‹  | 1245/1621 [34:15<09:02,  1.44s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1246/1621 [34:16<09:00,  1.44s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1247/1621 [34:18<08:57,  1.44s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1248/1621 [34:19<08:55,  1.44s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1249/1621 [34:21<09:18,  1.50s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1250/1621 [34:22<09:23,  1.52s/it]
                                                   

 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1250/1621 [34:22<09:23,  1.52s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1251/1621 [34:24<09:28,  1.54s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1252/1621 [34:25<09:19,  1.52s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1253/1621 [34:27<09:18,  1.52s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1254/1621 [34:28<09:18,  1.52s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1255/1621 [34:30<09:34,  1.57s/it]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1256/1621 [34:31<09:15,  1.52s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1257/1621 [34:33<09:02,  1.49s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1258/1621 
0: {'loss': 0.2585, 'grad_norm': 0.3250021337489215, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.78}
0: {'loss': 0.2579, 'grad_norm': 0.31493837801792507, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.78}
0: [34:34<08:57,  1.48s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1259/1621 [34:36<08:49,  1.46s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1260/1621 [34:37<08:46,  1.46s/it]
                                                   

 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1260/1621 [34:37<08:46,  1.46s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1261/1621 [34:39<08:44,  1.46s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1262/1621 [34:40<09:09,  1.53s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1263/1621 [34:42<08:57,  1.50s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1264/1621 [34:43<08:50,  1.48s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1265/1621 [34:45<08:45,  1.48s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1266/1621 [34:46<08:37,  1.46s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1267/1621 [34:47<08:30,  1.44s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1268/1621 [34:49<08:48,  1.50s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1269/1621 [34:50<08:37,  1.47s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1270/1621 [34:52<08:29,  1.45s/it]
                                                   

 78%|β–ˆοΏ½
0: {'loss': 0.252, 'grad_norm': 0.30830369578361677, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.79}
0: οΏ½οΏ½β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1270/1621 [34:52<08:29,  1.45s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1271/1621 [34:53<08:42,  1.49s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1272/1621 [34:55<08:33,  1.47s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1273/1621 [34:56<08:28,  1.46s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1274/1621 [34:58<08:40,  1.50s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1275/1621 [34:59<08:33,  1.48s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1276/1621 [35:01<08:26,  1.47s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1277/1621 [35:02<08:35,  1.50s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1278/1621 [35:04<08:26,  1.48s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1279/1621 [35:05<08:23,  1.47s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1280/1621 [35:07<08:24,  1.48s/it]
                                                   

 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1280/1621 [35:07<08:24,  1.48s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1281/1621 [35:08<08:18,  1.47s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1282/1621 [35:10<08:12,  1.45s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.2576, 'grad_norm': 0.3198839807651436, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.8}
0: οΏ½  | 1283/1621 [35:11<08:07,  1.44s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1284/1621 [35:13<08:10,  1.46s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1285/1621 [35:14<08:06,  1.45s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1286/1621 [35:15<08:08,  1.46s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1287/1621 [35:17<08:23,  1.51s/it]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1288/1621 [35:18<08:13,  1.48s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1289/1621 [35:20<08:10,  1.48s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1290/1621 [35:21<08:06,  1.47s/it]
                                                   

 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1290/1621 [35:21<08:06,  1.47s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1291/1621 [35:23<08:00,  1.46s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1292/1621 [35:24<07:56,  1.45s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1293/1621 [35:26<07:56,  1.45s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1294/1621 [35:27<07:51,  1.44s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1295/1621 [35:29<07:52,  1.45s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1296/1621 [35:
0: {'loss': 0.2577, 'grad_norm': 0.3200484579932923, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.8}
0: [2025-09-02 19:23:26,549] [INFO] [axolotl.core.trainers.base._save:613] [PID:1478787] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0/checkpoint-1300
0: [2025-09-02 19:23:31,543] [INFO] [axolotl.core.trainers.base._save:662] [PID:1478787] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
0: 30<07:53,  1.46s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1297/1621 [35:32<07:52,  1.46s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1298/1621 [35:33<07:49,  1.45s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1299/1621 [35:34<07:43,  1.44s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1300/1621 [35:36<07:42,  1.44s/it]
                                                   

 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1300/1621 [35:36<07:42,  1.44s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1301/1621 [35:47<22:41,  4.25s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1302/1621 [35:48<18:06,  3.41s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1303/1621 [35:49<14:53,  2.81s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1304/1621 [35:51<12:58,  2.46s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1305/1621 [35:53<11:20,  2.15s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1306/1621 [35:54<10:11,  1.94s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1307/1621 [35:55<09:20,  1.79s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1308/1621 [35:57<08:43,  1.67s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1309/1621 [35:58<08:21,  1.61s/it
0: {'loss': 0.2561, 'grad_norm': 0.32112274253173473, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.81}
0: {'loss': 0.2534, 'grad_norm': 0.3283578387479428, 'learning_rate': 5e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.81}
0: ]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1310/1621 [36:00<08:01,  1.55s/it]
                                                   

 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1310/1621 [36:00<08:01,  1.55s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1311/1621 [36:01<07:50,  1.52s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1312/1621 [36:03<07:39,  1.49s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1313/1621 [36:04<08:01,  1.56s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1314/1621 [36:06<08:18,  1.62s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1315/1621 [36:08<07:59,  1.57s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1316/1621 [36:09<07:47,  1.53s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1317/1621 [36:11<07:46,  1.54s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1318/1621 [36:12<07:35,  1.50s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1319/1621 [36:13<07:28,  1.49s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1320/1621 [36:15<07:20,  1.46s/it]
                                                   

 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1320/1621 [36:15<07:20,  1.46s/it]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.2488, 'grad_norm': 0.30823559386484073, 'learning_rate': 4.9921089333113855e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.82}
0: οΏ½οΏ½β–ˆβ– | 1321/1621 [36:16<07:15,  1.45s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1322/1621 [36:18<07:11,  1.44s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1323/1621 [36:19<07:12,  1.45s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1324/1621 [36:21<07:06,  1.44s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1325/1621 [36:22<07:13,  1.46s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1326/1621 [36:23<07:07,  1.45s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1327/1621 [36:25<07:01,  1.43s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1328/1621 [36:27<07:29,  1.53s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1329/1621 [36:28<07:22,  1.52s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1330/1621 [36:30<07:13,  1.49s/it]
                                                   

 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1330/1621 [36:30<07:13,  1.49s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1331/1621 [36:31<07:07,  1.47s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1332/1621 [36:32<07:03,  1.46s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1333/1621 [36:34<06:56,  1.45s/it]
 82%|β–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.2566, 'grad_norm': 0.30367199165447895, 'learning_rate': 4.96014631413955e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.83}
0: οΏ½οΏ½β–ˆβ–ˆβ–ˆβ–ˆβ– | 1334/1621 [36:35<06:53,  1.44s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1335/1621 [36:37<06:52,  1.44s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1336/1621 [36:38<06:47,  1.43s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1337/1621 [36:39<06:45,  1.43s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1338/1621 [36:41<06:43,  1.43s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1339/1621 [36:42<06:40,  1.42s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1340/1621 [36:44<06:38,  1.42s/it]
                                                   

 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1340/1621 [36:44<06:38,  1.42s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1341/1621 [36:45<06:48,  1.46s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1342/1621 [36:47<06:43,  1.45s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1343/1621 [36:48<06:40,  1.44s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1344/1621 [36:50<06:36,  1.43s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1345/1621 [36:51<06:40,  1.45s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1346/1621 [36:53<06:42,  1.46s/it]
 83%|οΏ½
0: {'loss': 0.2512, 'grad_norm': 0.2884465188753474, 'learning_rate': 4.903968869447152e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.83}
0: οΏ½οΏ½β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1347/1621 [36:54<06:53,  1.51s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1348/1621 [36:56<06:48,  1.50s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1349/1621 [36:57<06:51,  1.51s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1350/1621 [36:59<06:45,  1.49s/it]
                                                   

 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1350/1621 [36:59<06:45,  1.49s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1351/1621 [37:00<06:37,  1.47s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1352/1621 [37:01<06:32,  1.46s/it]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1353/1621 [37:03<06:27,  1.44s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1354/1621 [37:04<06:26,  1.45s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1355/1621 [37:06<06:52,  1.55s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1356/1621 [37:08<06:39,  1.51s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1357/1621 [37:09<07:04,  1.61s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1358/1621 [37:11<06:48,  1.55s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1359/1621 [37:12<06:40,  1.53s/i
0: {'loss': 0.2567, 'grad_norm': 0.32993917642338455, 'learning_rate': 4.824192091074126e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.84}
0: {'loss': 0.2595, 'grad_norm': 0.29082902932313515, 'learning_rate': 4.721690030098693e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.85}
0: t]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1360/1621 [37:14<06:31,  1.50s/it]
                                                   

 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1360/1621 [37:14<06:31,  1.50s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1361/1621 [37:15<06:25,  1.48s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1362/1621 [37:17<06:20,  1.47s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1363/1621 [37:18<06:18,  1.47s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1364/1621 [37:19<06:14,  1.46s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1365/1621 [37:21<06:28,  1.52s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1366/1621 [37:23<06:17,  1.48s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1367/1621 [37:24<06:20,  1.50s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1368/1621 [37:26<06:17,  1.49s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1369/1621 [37:27<06:34,  1.56s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1370/1621 [37:29<06:25,  1.54s/it]
                                                   

 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1370/1621 [37:29<06:25,  1.54s/it]
 85%|
0: {'loss': 0.2575, 'grad_norm': 0.30422469733818247, 'learning_rate': 4.5975857205508345e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.85}
0: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1371/1621 [37:30<06:17,  1.51s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1372/1621 [37:32<06:13,  1.50s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1373/1621 [37:33<06:14,  1.51s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1374/1621 [37:35<06:07,  1.49s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1375/1621 [37:37<06:55,  1.69s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1376/1621 [37:38<06:37,  1.62s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1377/1621 [37:40<06:24,  1.58s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1378/1621 [37:41<06:11,  1.53s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1379/1621 [37:43<06:05,  1.51s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1380/1621 [37:44<05:57,  1.48s/it]
                                                   

 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1380/1621 [37:44<05:57,  1.48s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1381/1621 [37:45<05:53,  1.47s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1382/1621 [37:47<05:48,  1.46s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1383/1621 [37:48<05:43,  1.44s/
0: {'loss': 0.2561, 'grad_norm': 0.319920303484507, 'learning_rate': 4.453238875216452e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.86}
0: it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1384/1621 [37:50<05:39,  1.43s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1385/1621 [37:51<05:39,  1.44s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1386/1621 [37:53<05:47,  1.48s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1387/1621 [37:54<05:55,  1.52s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1388/1621 [37:56<06:00,  1.55s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1389/1621 [37:57<05:49,  1.51s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1390/1621 [37:59<05:43,  1.49s/it]
                                                   

 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1390/1621 [37:59<05:43,  1.49s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1391/1621 [38:00<05:43,  1.50s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1392/1621 [38:02<05:37,  1.48s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1393/1621 [38:03<05:35,  1.47s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1394/1621 [38:05<05:45,  1.52s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1395/1621 [38:07<05:57,  1.58s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1396/1621 [38:08<05:48
0: {'loss': 0.2553, 'grad_norm': 0.3046012760094752, 'learning_rate': 4.29023098833955e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.86}
0: ,  1.55s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1397/1621 [38:09<05:37,  1.51s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1398/1621 [38:11<05:40,  1.53s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1399/1621 [38:13<05:39,  1.53s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1400/1621 [38:14<05:32,  1.50s/it]
                                                   

 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1400/1621 [38:14<05:32,  1.50s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1401/1621 [38:15<05:25,  1.48s/it]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1402/1621 [38:17<05:35,  1.53s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1403/1621 [38:19<05:26,  1.50s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1404/1621 [38:20<05:20,  1.48s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1405/1621 [38:21<05:15,  1.46s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1406/1621 [38:23<05:11,  1.45s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1407/1621 [38:24<05:07,  1.44s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1408/1621 [38:26<05:10,  1.46s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1409/1621 [38
0: {'loss': 0.2563, 'grad_norm': 0.30239637984353285, 'learning_rate': 4.110348008440344e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.87}
0: {'loss': 0.2615, 'grad_norm': 0.30827484115318715, 'learning_rate': 3.915560771089544e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.88}
0: :27<05:07,  1.45s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1410/1621 [38:29<05:06,  1.45s/it]
                                                   

 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1410/1621 [38:29<05:06,  1.45s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1411/1621 [38:30<05:07,  1.47s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1412/1621 [38:32<05:03,  1.45s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1413/1621 [38:33<05:00,  1.44s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1414/1621 [38:35<05:11,  1.51s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1415/1621 [38:36<05:15,  1.53s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1416/1621 [38:38<05:06,  1.49s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1417/1621 [38:39<05:02,  1.48s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1418/1621 [38:40<04:57,  1.46s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1419/1621 [38:42<04:54,  1.46s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1420/1621 [38:43<04:50,  1.45s/it]
                                                   

 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1420/1621 [38:43<04:5
0: {'loss': 0.2523, 'grad_norm': 0.2940110425585398, 'learning_rate': 3.7080034060214136e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.88}
0: 0,  1.45s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1421/1621 [38:45<05:00,  1.50s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1422/1621 [38:46<04:53,  1.48s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1423/1621 [38:48<04:48,  1.46s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1424/1621 [38:49<04:45,  1.45s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1425/1621 [38:51<04:42,  1.44s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1426/1621 [38:52<04:39,  1.43s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1427/1621 [38:54<04:41,  1.45s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1428/1621 [38:55<04:39,  1.45s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1429/1621 [38:56<04:36,  1.44s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1430/1621 [38:58<04:43,  1.49s/it]
                                                   

 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1430/1621 [38:58<04:43,  1.49s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1431/1621 [39:00<04:41,  1.48s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1432/1621 [39:01<04:36,  1.46s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1433/1621 [3
0: {'loss': 0.2507, 'grad_norm': 0.3055514932413729, 'learning_rate': 3.489949955161813e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.89}
0: 9:02<04:32,  1.45s/it]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1434/1621 [39:04<04:30,  1.45s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1435/1621 [39:05<04:34,  1.47s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1436/1621 [39:07<04:30,  1.46s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1437/1621 [39:08<04:28,  1.46s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1438/1621 [39:10<04:28,  1.46s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1439/1621 [39:11<04:23,  1.45s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1440/1621 [39:13<04:39,  1.54s/it]
                                                   

 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1440/1621 [39:13<04:39,  1.54s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1441/1621 [39:14<04:32,  1.51s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1442/1621 [39:16<04:26,  1.49s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1443/1621 [39:17<04:21,  1.47s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1444/1621 [39:19<04:17,  1.45s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1445/1621 [39:20<04:14,  1.45s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 144
0: {'loss': 0.2592, 'grad_norm': 0.308632847742623, 'learning_rate': 3.263789457748976e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.89}
0: 6/1621 [39:21<04:12,  1.44s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1447/1621 [39:23<04:12,  1.45s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1448/1621 [39:24<04:10,  1.45s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1449/1621 [39:26<04:11,  1.46s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1450/1621 [39:27<04:14,  1.49s/it]
                                                   

 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1450/1621 [39:27<04:14,  1.49s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1451/1621 [39:29<04:09,  1.47s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1452/1621 [39:30<04:06,  1.46s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1453/1621 [39:32<04:14,  1.52s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1454/1621 [39:34<04:20,  1.56s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1455/1621 [39:35<04:12,  1.52s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1456/1621 [39:36<04:06,  1.49s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1457/1621 [39:38<04:02,  1.48s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1458/1621 [39:39<04:03,  1.49s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
0: {'loss': 0.2494, 'grad_norm': 0.2790640815194346, 'learning_rate': 3.031999775519685e-06, 'memory/max_mem_active(gib)': 35.94, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.9}
0: {'loss': 0.2612, 'grad_norm': 0.32053738639864715, 'learning_rate': 2.7971204447375534e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.91}
0: β–ˆ | 1459/1621 [39:41<03:57,  1.47s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1460/1621 [39:42<03:54,  1.46s/it]
                                                   

 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1460/1621 [39:42<03:54,  1.46s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1461/1621 [39:44<03:52,  1.45s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1462/1621 [39:45<03:50,  1.45s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1463/1621 [39:47<03:47,  1.44s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1464/1621 [39:48<03:54,  1.49s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1465/1621 [39:50<04:03,  1.56s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1466/1621 [39:52<04:06,  1.59s/it]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1467/1621 [39:53<03:56,  1.54s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1468/1621 [39:55<03:57,  1.55s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1469/1621 [39:56<03:50,  1.51s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1470/1621 [39:57<03:44,  1.48s/it]
                                                   

 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 14
0: {'loss': 0.2583, 'grad_norm': 0.30047512075465854, 'learning_rate': 2.561724852502291e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.91}
0: 70/1621 [39:57<03:44,  1.48s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1471/1621 [39:59<03:49,  1.53s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1472/1621 [40:00<03:45,  1.51s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1473/1621 [40:02<03:40,  1.49s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1474/1621 [40:03<03:35,  1.47s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1475/1621 [40:05<03:31,  1.45s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1476/1621 [40:06<03:30,  1.45s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1477/1621 [40:08<03:28,  1.45s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1478/1621 [40:09<03:25,  1.44s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1479/1621 [40:10<03:23,  1.43s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1480/1621 [40:12<03:21,  1.43s/it]
                                                   

 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1480/1621 [40:12<03:21,  1.43s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1481/1621 [40:13<03:23,  1.45s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1482/1621 [40:15<03:22,  1.45s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
0: {'loss': 0.2514, 'grad_norm': 0.2898197692202884, 'learning_rate': 2.3283920421821194e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.92}
0: β–ˆβ–ˆβ–ˆβ–ˆβ–| 1483/1621 [40:16<03:19,  1.44s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1484/1621 [40:18<03:19,  1.46s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1485/1621 [40:19<03:19,  1.47s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1486/1621 [40:21<03:16,  1.45s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1487/1621 [40:22<03:27,  1.55s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1488/1621 [40:24<03:21,  1.51s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1489/1621 [40:25<03:15,  1.48s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1490/1621 [40:27<03:15,  1.49s/it]
                                                   

 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1490/1621 [40:27<03:15,  1.49s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1491/1621 [40:28<03:11,  1.47s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1492/1621 [40:30<03:08,  1.46s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1493/1621 [40:31<03:06,  1.46s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1494/1621 [40:33<03:03,  1.44s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1495/1621 [40:34<0
0: {'loss': 0.2493, 'grad_norm': 0.2788921768117285, 'learning_rate': 2.099678456874939e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.93}
0: 3:07,  1.49s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1496/1621 [40:36<03:03,  1.47s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1497/1621 [40:37<03:02,  1.47s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1498/1621 [40:38<02:59,  1.46s/it]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1499/1621 [40:40<02:56,  1.44s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1500/1621 [40:41<02:54,  1.44s/it]
                                                   

 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1500/1621 [40:41<02:54,  1.44s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1501/1621 [40:43<02:52,  1.44s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1502/1621 [40:44<02:50,  1.43s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1503/1621 [40:46<02:48,  1.43s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1504/1621 [40:47<02:52,  1.47s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1505/1621 [40:49<02:52,  1.48s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1506/1621 [40:50<02:48,  1.47s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1507/1621 [40:51<02:45,  1.45s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.2549, 'grad_norm': 0.292809089573132, 'learning_rate': 1.8780899304827687e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.93}
0: οΏ½οΏ½β–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1508/1621 [40:53<02:42,  1.44s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1509/1621 [40:55<02:49,  1.51s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1510/1621 [40:56<02:45,  1.49s/it]
                                                   

 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1510/1621 [40:56<02:45,  1.49s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1511/1621 [40:57<02:43,  1.48s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1512/1621 [40:59<02:41,  1.48s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1513/1621 [41:00<02:37,  1.46s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1514/1621 [41:02<02:35,  1.45s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1515/1621 [41:03<02:35,  1.47s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1516/1621 [41:05<02:33,  1.46s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1517/1621 [41:06<02:31,  1.45s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1518/1621 [41:08<02:28,  1.44s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1519/1621 [41:09<02:26,  1.44s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1520/1621 [41:11
0: {'loss': 0.2564, 'grad_norm': 0.3019723689156526, 'learning_rate': 1.6660542332711405e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.94}
0: {'loss': 0.2615, 'grad_norm': 0.28458470847453754, 'learning_rate': 1.465894472710029e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.94}
0: <02:26,  1.45s/it]
                                                   

 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1520/1621 [41:11<02:26,  1.45s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1521/1621 [41:12<02:26,  1.47s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1522/1621 [41:14<02:37,  1.59s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1523/1621 [41:15<02:31,  1.55s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1524/1621 [41:17<02:29,  1.54s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1525/1621 [41:18<02:24,  1.50s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1526/1621 [41:20<02:20,  1.47s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1527/1621 [41:21<02:17,  1.46s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1528/1621 [41:23<02:14,  1.44s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1529/1621 [41:24<02:12,  1.44s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1530/1621 [41:25<02:10,  1.44s/it]
                                                   

 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1530/1621 [41:25<02:10,  1.44s/it]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–
0: {'loss': 0.2471, 'grad_norm': 0.28628626975937854, 'learning_rate': 1.2798036410222628e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.95}
0: | 1531/1621 [41:27<02:11,  1.46s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1532/1621 [41:28<02:08,  1.45s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1533/1621 [41:30<02:10,  1.49s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1534/1621 [41:31<02:08,  1.47s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1535/1621 [41:33<02:10,  1.51s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1536/1621 [41:34<02:06,  1.48s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1537/1621 [41:36<02:13,  1.59s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1538/1621 [41:38<02:14,  1.62s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1539/1621 [41:39<02:08,  1.56s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1540/1621 [41:41<02:04,  1.53s/it]
                                                   

 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1540/1621 [41:41<02:04,  1.53s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1541/1621 [41:42<02:00,  1.51s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1542/1621 [41:44<01:56,  1.48s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1543/1621 [41:45<01:53,  1.46s/it
0: {'loss': 0.2498, 'grad_norm': 0.2849557358168061, 'learning_rate': 1.1098205883018246e-06, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.96}
0: ]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1544/1621 [41:46<01:51,  1.44s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1545/1621 [41:48<01:49,  1.44s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1546/1621 [41:49<01:47,  1.44s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1547/1621 [41:51<01:46,  1.43s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1548/1621 [41:52<01:44,  1.43s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1549/1621 [41:54<01:42,  1.43s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1550/1621 [41:55<01:42,  1.44s/it]
                                                   

 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1550/1621 [41:55<01:42,  1.44s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1551/1621 [41:57<01:40,  1.43s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1552/1621 [41:58<01:42,  1.49s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1553/1621 [42:00<01:40,  1.47s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1554/1621 [42:01<01:38,  1.47s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1555/1621 [42:02<01:36,  1.47s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆοΏ½
0: {'loss': 0.2488, 'grad_norm': 0.28016585021963597, 'learning_rate': 9.578076844455587e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.96}
0: οΏ½οΏ½| 1556/1621 [42:04<01:34,  1.46s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1557/1621 [42:05<01:32,  1.44s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1558/1621 [42:07<01:30,  1.44s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1559/1621 [42:08<01:31,  1.48s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1560/1621 [42:10<01:30,  1.48s/it]
                                                   

 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1560/1621 [42:10<01:30,  1.48s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1561/1621 [42:11<01:27,  1.46s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1562/1621 [42:13<01:25,  1.44s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1563/1621 [42:14<01:24,  1.45s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1564/1621 [42:16<01:22,  1.45s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1565/1621 [42:17<01:20,  1.44s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1566/1621 [42:18<01:19,  1.44s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1567/1621 [42:20<01:17,  1.43s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1568/1621 [42:21<01:16,  1.44s/
0: {'loss': 0.2597, 'grad_norm': 0.2927629804951191, 'learning_rate': 8.254304146388603e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.97}
0: it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1569/1621 [42:23<01:15,  1.46s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1570/1621 [42:24<01:16,  1.50s/it]
                                                   

 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1570/1621 [42:24<01:16,  1.50s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1571/1621 [42:26<01:13,  1.48s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1572/1621 [42:27<01:13,  1.50s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1573/1621 [42:29<01:11,  1.49s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1574/1621 [42:30<01:09,  1.48s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1575/1621 [42:32<01:07,  1.46s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1576/1621 [42:33<01:05,  1.46s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1577/1621 [42:35<01:04,  1.45s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1578/1621 [42:36<01:02,  1.45s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1579/1621 [42:37<01:00,  1.44s/it]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1580/1621 [42:39<00:58,  1.44s/it]
                               
0: {'loss': 0.2517, 'grad_norm': 0.2752181427419336, 'learning_rate': 7.141391319514565e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.97}
0: {'loss': 0.2526, 'grad_norm': 0.2740685329096038, 'learning_rate': 6.251531669656679e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.98}
0:                     

 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1580/1621 [42:39<00:58,  1.44s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1581/1621 [42:40<00:58,  1.46s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1582/1621 [42:42<00:56,  1.44s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1583/1621 [42:43<00:54,  1.44s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1584/1621 [42:45<00:53,  1.44s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1585/1621 [42:46<00:51,  1.44s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1586/1621 [42:48<00:50,  1.43s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1587/1621 [42:49<00:49,  1.44s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1588/1621 [42:51<00:48,  1.48s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1589/1621 [42:52<00:46,  1.46s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1590/1621 [42:53<00:44,  1.45s/it]
                                                   

 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1590/1621 [42:53<00:44,  1.45s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1591/1621 [42:55<00:43,  1.45s/it]
 98%|β–ˆβ–ˆοΏ½
0: {'loss': 0.2544, 'grad_norm': 0.29567761780713164, 'learning_rate': 5.594474685353894e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.99}
0: οΏ½β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1592/1621 [42:56<00:41,  1.44s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1593/1621 [42:58<00:41,  1.47s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1594/1621 [43:00<00:41,  1.55s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1595/1621 [43:01<00:39,  1.50s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1596/1621 [43:02<00:37,  1.50s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1597/1621 [43:04<00:35,  1.48s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1598/1621 [43:05<00:33,  1.46s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1599/1621 [43:07<00:32,  1.46s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1600/1621 [43:08<00:30,  1.45s/it]
                                                   

 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1600/1621 [43:08<00:30,  1.45s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1601/1621 [43:10<00:29,  1.47s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1602/1621 [43:11<00:27,  1.45s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1603/1621 [43:13<00:26,  1.48s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1604/1621 [
0: {'loss': 0.2525, 'grad_norm': 0.28413860743448904, 'learning_rate': 5.177419220424251e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 0.99}
0: 43:14<00:24,  1.47s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1605/1621 [43:16<00:24,  1.52s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1606/1621 [43:17<00:22,  1.50s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1607/1621 [43:19<00:20,  1.50s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1608/1621 [43:20<00:19,  1.47s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1609/1621 [43:22<00:18,  1.51s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1610/1621 [43:23<00:16,  1.49s/it]
                                                   

 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1610/1621 [43:23<00:16,  1.49s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1611/1621 [43:25<00:14,  1.47s/it]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1612/1621 [43:26<00:13,  1.46s/it]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1613/1621 [43:27<00:11,  1.47s/it]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1614/1621 [43:29<00:10,  1.45s/it]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1615/1621 [43:30<00:08,  1.45s/it]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1616/1621 [43:32<00:07,  1.44s/it]
100%|β–ˆβ–ˆ
0: {'loss': 0.2512, 'grad_norm': 0.26094743319216807, 'learning_rate': 5.004934621815976e-07, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 1.0}
0: [2025-09-02 19:31:38,125] [INFO] [axolotl.core.trainers.base._save:613] [PID:1478787] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0/checkpoint-1621
0: [2025-09-02 19:31:42,994] [INFO] [axolotl.core.trainers.base._save:662] [PID:1478787] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
0: {'train_runtime': 2637.1175, 'train_samples_per_second': 9.835, 'train_steps_per_second': 0.615, 'train_loss': 0.26498376052737016, 'memory/max_mem_active(gib)': 35.96, 'memory/max_mem_allocated(gib)': 35.16, 'memory/device_mem_reserved(gib)': 43.51, 'epoch': 1.0}
0: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1617/1621 [43:33<00:05,  1.44s/it]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1618/1621 [43:35<00:04,  1.44s/it]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1619/1621 [43:36<00:02,  1.44s/it]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1620/1621 [43:38<00:01,  1.45s/it]
                                                   

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1620/1621 [43:38<00:01,  1.45s/it]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1621/1621 [43:48<00:00,  4.02s/it]
                                                   

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1621/1621 [43:57<00:00,  4.02s/it]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1621/1621 [43:57<00:00,  1.63s/it]
0: [2025-09-02 19:31:45,684] [INFO] [axolotl.train.save_trained_model:228] [PID:1478787] [RANK:0] Training completed! Saving trained model to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0.
0: [2025-09-02 19:31:47,159] [INFO] [axolotl.core.trainers.base._save:613] [PID:1478787] [RANK:0] Saving model checkpoint to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0
0: [2025-09-02 19:31:51,888] [INFO] [axolotl.core.trainers.base._save:662] [PID:1478787] [RANK:0] Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
0: [2025-09-02 19:31:52,303] [INFO] [axolotl.train.save_trained_model:350] [PID:1478787] [RANK:0] Model successfully saved to /lustre/fswork/projects/rech/dgo/udv55np/math/Qwen3-235B-A22B/Qwen2.5-3B_ift/0