diff --git "a/20250514_214337.log" "b/20250514_214337.log" new file mode 100644--- /dev/null +++ "b/20250514_214337.log" @@ -0,0 +1,126 @@ +[2025-05-14 21:43:37] Created output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000 +[2025-05-14 21:43:37] Chat mode disabled +[2025-05-14 21:43:37] Model size is 3B or smaller (1 B). Using full fine-tuning. +[2025-05-14 21:43:37] No QA format data will be used +[2025-05-14 21:43:37] Limiting dataset size to: 100 samples +[2025-05-14 21:43:37] ======================================= +[2025-05-14 21:43:37] Starting training for model: google/gemma-3-1b-pt +[2025-05-14 21:43:37] ======================================= +[2025-05-14 21:43:37] CUDA_VISIBLE_DEVICES: 0,1,2,3 +[2025-05-14 21:43:37] WANDB_PROJECT: wikidyk-ar +[2025-05-14 21:43:37] DATA_PATH: data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2_trainqas.json +[2025-05-14 21:43:37] Global Batch Size: 128 +[2025-05-14 21:43:37] Data Size: 100 +[2025-05-14 21:43:37] Executing command: torchrun --nproc_per_node "4" --master-port 29581 src/train.py --model_name_or_path "google/gemma-3-1b-pt" --data_path "data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2_trainqas.json" --output_dir "train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000" --num_upsample "1000" --per_device_train_batch_size "32" --gradient_accumulation_steps "1" --learning_rate "2e-5" --num_train_epochs "1" --model_max_length "32768" --report_to wandb --logging_steps 50 --save_strategy steps --save_steps 10000 --save_total_limit 3 --resume_from_checkpoint True --bf16 True --use_flash_attention_2 True --qa_data_ratio "-1" --predict_mask "false" --ds_size 100 +[2025-05-14 21:43:37] Training started at Wed May 14 21:43:37 UTC 2025 +W0514 21:43:38.845000 618618 site-packages/torch/distributed/run.py:792] +W0514 21:43:38.845000 618618 site-packages/torch/distributed/run.py:792] ***************************************** +W0514 21:43:38.845000 618618 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0514 21:43:38.845000 618618 site-packages/torch/distributed/run.py:792] ***************************************** +WARNING:root:Output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000 +The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. +You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. +WARNING:root:Output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000 +The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. +You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. +WARNING:root:Output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000 +WARNING:root:Output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000 +The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. +You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. +WARNING:root:Loading data... +The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. +You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. +WARNING:root:Loading data... +WARNING:root:Loading data... +WARNING:root:Loading data... +WARNING:root:Dataset initialized with all QA data: +WARNING:root: - 100000 QA examples +WARNING:root: - 100 fact examples with upsampling factor 1000 +WARNING:root: - Total examples: 200000 +/root/yuwei/WikiDYKEvalV2/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. + trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module) +WARNING:root:Dataset initialized with all QA data: +WARNING:root: - 100000 QA examples +WARNING:root: - 100 fact examples with upsampling factor 1000 +WARNING:root: - Total examples: 200000 +/root/yuwei/WikiDYKEvalV2/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. + trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module) +WARNING:root:Dataset initialized with all QA data: +WARNING:root: - 100000 QA examples +WARNING:root: - 100 fact examples with upsampling factor 1000 +WARNING:root: - Total examples: 200000 +/root/yuwei/WikiDYKEvalV2/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. + trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module) +WARNING:root:Dataset initialized with all QA data: +WARNING:root: - 100000 QA examples +WARNING:root: - 100 fact examples with upsampling factor 1000 +WARNING:root: - Total examples: 200000 +/root/yuwei/WikiDYKEvalV2/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. + trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module) +Checkpoint missing; starting training from scratch +Checkpoint missing; starting training from scratch +Checkpoint missing; starting training from scratch +Checkpoint missing; starting training from scratch +wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter. +It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('', attn_implementation='eager')`. +It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('', attn_implementation='eager')`. +It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('', attn_implementation='eager')`. +wandb: Currently logged in as: yuweiz to https://api.wandb.ai. Use `wandb login --relogin` to force relogin +wandb: Tracking run with wandb version 0.19.11 +wandb: Run data is saved locally in /root/yuwei/WikiDYKEvalV2/wandb/run-20250514_214351-thkr8ndb +wandb: Run `wandb offline` to turn off syncing. +wandb: Syncing run train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000 +wandb: ⭐️ View project at https://wandb.ai/yuweiz/wikidyk-ar +wandb: 🚀 View run at https://wandb.ai/yuweiz/wikidyk-ar/runs/thkr8ndb + 0%| | 0/1563 [00:00', attn_implementation='eager')`. +[rank2]:[W514 21:43:53.328500884 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank1]:[W514 21:43:53.333029675 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank3]:[W514 21:43:53.336456929 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank0]:[W514 21:43:53.339178719 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) + 0%| | 1/1563 [00:02<58:53, 2.26s/it] 0%| | 2/1563 [00:02<31:34, 1.21s/it] 0%| | 3/1563 [00:03<25:28, 1.02it/s] 0%| | 4/1563 [00:04<29:51, 1.15s/it] 0%| | 5/1563 [00:05<23:51, 1.09it/s] 0%| | 6/1563 [00:06<23:17, 1.11it/s] 0%| | 7/1563 [00:06<21:42, 1.19it/s] 1%| | 8/1563 [00:07<22:32, 1.15it/s] 1%| | 9/1563 [00:08<21:40, 1.19it/s] 1%| | 10/1563 [00:09<18:39, 1.39it/s] 1%| | 11/1563 [00:09<16:43, 1.55it/s] 1%| | 12/1563 [00:10<18:18, 1.41it/s] 1%| | 13/1563 [00:10<17:12, 1.50it/s] 1%| | 14/1563 [00:11<17:28, 1.48it/s] 1%| | 15/1563 [00:12<17:57, 1.44it/s] 1%| | 16/1563 [00:13<17:49, 1.45it/s] 1%| | 17/1563 [00:13<17:00, 1.52it/s] 1%| | 18/1563 [00:14<15:29, 1.66it/s] 1%| | 19/1563 [00:14<17:11, 1.50it/s] 1%|▏ | 20/1563 [00:15<16:15, 1.58it/s] 1%|▏ | 21/1563 [00:16<16:34, 1.55it/s] 1%|▏ | 22/1563 [00:17<22:29, 1.14it/s] 1%|▏ | 23/1563 [00:18<20:43, 1.24it/s] 2%|▏ | 24/1563 [00:18<18:00, 1.42it/s] 2%|▏ | 25/1563 [00:19<17:10, 1.49it/s] 2%|▏ | 26/1563 [00:19<15:25, 1.66it/s] 2%|▏ | 27/1563 [00:20<17:09, 1.49it/s] 2%|▏ | 28/1563 [00:21<17:14, 1.48it/s] 2%|▏ | 29/1563 [00:21<16:29, 1.55it/s] 2%|▏ | 30/1563 [00:22<15:06, 1.69it/s] 2%|▏ | 31/1563 [00:22<14:10, 1.80it/s] 2%|▏ | 32/1563 [00:23<16:17, 1.57it/s] 2%|▏ | 33/1563 [00:24<20:09, 1.26it/s] 2%|▏ | 34/1563 [00:25<17:39, 1.44it/s] 2%|▏ | 35/1563 [00:25<15:57, 1.60it/s] 2%|▏ | 36/1563 [00:26<15:51, 1.60it/s] 2%|▏ | 37/1563 [00:26<15:29, 1.64it/s] 2%|▏ | 38/1563 [00:27<16:19, 1.56it/s] 2%|▏ | 39/1563 [00:28<16:26, 1.54it/s] 3%|▎ | 40/1563 [00:28<16:19, 1.56it/s] 3%|▎ | 41/1563 [00:29<15:50, 1.60it/s] 3%|▎ | 42/1563 [00:30<16:59, 1.49it/s] 3%|▎ | 43/1563 [00:30<17:21, 1.46it/s] 3%|▎ | 44/1563 [00:31<18:16, 1.38it/s] 3%|▎ | 45/1563 [00:32<17:32, 1.44it/s] 3%|▎ | 46/1563 [00:32<15:54, 1.59it/s] 3%|▎ | 47/1563 [00:33<15:21, 1.65it/s] 3%|▎ | 48/1563 [00:33<14:43, 1.71it/s] 3%|▎ | 49/1563 [00:34<15:37, 1.62it/s] 3%|▎ | 50/1563 [00:35<16:50, 1.50it/s] {'loss': 3.5671, 'grad_norm': 996.0, 'learning_rate': 1.9373000639795267e-05, 'epoch': 0.03} + 3%|▎ | 50/1563 [00:35<16:50, 1.50it/s] 3%|▎ | 51/1563 [00:36<17:52, 1.41it/s] 3%|▎ | 52/1563 [00:36<16:09, 1.56it/s] 3%|▎ | 53/1563 [00:37<17:24, 1.45it/s] 3%|▎ | 54/1563 [00:38<18:36, 1.35it/s] 4%|▎ | 55/1563 [00:39<19:03, 1.32it/s] 4%|▎ | 56/1563 [00:39<17:03, 1.47it/s] 4%|▎ | 57/1563 [00:41<21:41, 1.16it/s] 4%|▎ | 58/1563 [00:41<20:19, 1.23it/s] 4%|▍ | 59/1563 [00:42<17:54, 1.40it/s] 4%|▍ | 60/1563 [00:42<16:05, 1.56it/s] 4%|▍ | 61/1563 [00:43<17:34, 1.42it/s] 4%|▍ | 62/1563 [00:44<18:03, 1.38it/s] 4%|▍ | 63/1563 [00:44<17:18, 1.44it/s] 4%|▍ | 64/1563 [00:45<17:29, 1.43it/s] 4%|▍ | 65/1563 [00:46<16:14, 1.54it/s] 4%|▍ | 66/1563 [00:46<16:43, 1.49it/s] 4%|▍ | 67/1563 [00:47<15:01, 1.66it/s] 4%|▍ | 68/1563 [00:47<14:43, 1.69it/s] 4%|▍ | 69/1563 [00:48<13:42, 1.82it/s] 4%|▍ | 70/1563 [00:48<14:28, 1.72it/s] 5%|▍ | 71/1563 [00:49<16:06, 1.54it/s] 5%|▍ | 72/1563 [00:50<15:00, 1.66it/s] 5%|▍ | 73/1563 [00:50<15:14, 1.63it/s] 5%|▍ | 74/1563 [00:51<16:11, 1.53it/s] 5%|▍ | 75/1563 [00:52<16:13, 1.53it/s] 5%|▍ | 76/1563 [00:53<16:27, 1.51it/s] 5%|▍ | 77/1563 [00:53<14:50, 1.67it/s] 5%|▍ | 78/1563 [00:54<15:09, 1.63it/s] 5%|▌ | 79/1563 [00:54<13:55, 1.78it/s] 5%|▌ | 80/1563 [00:55<15:26, 1.60it/s] 5%|▌ | 81/1563 [00:56<15:48, 1.56it/s] 5%|▌ | 82/1563 [00:56<16:44, 1.47it/s] 5%|▌ | 83/1563 [00:57<15:32, 1.59it/s] 5%|▌ | 84/1563 [00:57<14:18, 1.72it/s] 5%|▌ | 85/1563 [00:58<16:08, 1.53it/s] 6%|▌ | 86/1563 [00:59<16:27, 1.50it/s] 6%|▌ | 87/1563 [00:59<14:37, 1.68it/s] 6%|▌ | 88/1563 [01:00<14:22, 1.71it/s] 6%|▌ | 89/1563 [01:01<16:11, 1.52it/s] 6%|▌ | 90/1563 [01:01<14:48, 1.66it/s] 6%|▌ | 91/1563 [01:02<16:03, 1.53it/s] 6%|▌ | 92/1563 [01:03<17:37, 1.39it/s] 6%|▌ | 93/1563 [01:03<16:19, 1.50it/s] 6%|▌ | 94/1563 [01:04<14:42, 1.66it/s] 6%|▌ | 95/1563 [01:05<17:03, 1.43it/s] 6%|▌ | 96/1563 [01:05<16:24, 1.49it/s] 6%|▌ | 97/1563 [01:06<16:37, 1.47it/s] 6%|▋ | 98/1563 [01:07<17:51, 1.37it/s] 6%|▋ | 99/1563 [01:07<16:22, 1.49it/s] 6%|▋ | 100/1563 [01:08<16:51, 1.45it/s] {'loss': 0.7259, 'grad_norm': 15.6875, 'learning_rate': 1.8733205374280233e-05, 'epoch': 0.06} + 6%|▋ | 100/1563 [01:08<16:51, 1.45it/s] 6%|▋ | 101/1563 [01:09<16:29, 1.48it/s] 7%|▋ | 102/1563 [01:09<16:37, 1.47it/s] 7%|▋ | 103/1563 [01:10<17:50, 1.36it/s] 7%|▋ | 104/1563 [01:11<16:26, 1.48it/s] 7%|▋ | 105/1563 [01:11<16:29, 1.47it/s] 7%|▋ | 106/1563 [01:12<14:43, 1.65it/s] 7%|▋ | 107/1563 [01:13<16:21, 1.48it/s] 7%|▋ | 108/1563 [01:13<15:55, 1.52it/s] 7%|▋ | 109/1563 [01:14<16:31, 1.47it/s] 7%|▋ | 110/1563 [01:15<16:16, 1.49it/s] 7%|▋ | 111/1563 [01:15<14:37, 1.65it/s] 7%|▋ | 112/1563 [01:16<15:08, 1.60it/s] 7%|▋ | 113/1563 [01:16<14:23, 1.68it/s] 7%|▋ | 114/1563 [01:17<14:49, 1.63it/s] 7%|▋ | 115/1563 [01:18<13:39, 1.77it/s] 7%|▋ | 116/1563 [01:18<15:36, 1.55it/s] 7%|▋ | 117/1563 [01:19<14:38, 1.65it/s] 8%|▊ | 118/1563 [01:19<14:31, 1.66it/s] 8%|▊ | 119/1563 [01:20<13:54, 1.73it/s] 8%|▊ | 120/1563 [01:20<12:50, 1.87it/s] 8%|▊ | 121/1563 [01:21<14:24, 1.67it/s] 8%|▊ | 122/1563 [01:22<13:48, 1.74it/s] 8%|▊ | 123/1563 [01:23<15:53, 1.51it/s] 8%|▊ | 124/1563 [01:23<14:52, 1.61it/s] 8%|▊ | 125/1563 [01:24<15:52, 1.51it/s] 8%|▊ | 126/1563 [01:25<17:03, 1.40it/s] 8%|▊ | 127/1563 [01:26<18:01, 1.33it/s] 8%|▊ | 128/1563 [01:26<18:37, 1.28it/s] 8%|▊ | 129/1563 [01:27<16:26, 1.45it/s] 8%|▊ | 130/1563 [01:28<16:22, 1.46it/s] 8%|▊ | 131/1563 [01:28<17:15, 1.38it/s] 8%|▊ | 132/1563 [01:29<15:33, 1.53it/s] 9%|▊ | 133/1563 [01:29<14:53, 1.60it/s] 9%|▊ | 134/1563 [01:30<16:31, 1.44it/s] 9%|▊ | 135/1563 [01:31<16:27, 1.45it/s] 9%|▊ | 136/1563 [01:32<16:54, 1.41it/s] 9%|▉ | 137/1563 [01:32<15:07, 1.57it/s] 9%|▉ | 138/1563 [01:33<16:33, 1.43it/s] 9%|▉ | 139/1563 [01:33<14:51, 1.60it/s] 9%|▉ | 140/1563 [01:34<15:38, 1.52it/s] 9%|▉ | 141/1563 [01:35<15:49, 1.50it/s] 9%|▉ | 142/1563 [01:35<14:10, 1.67it/s] 9%|▉ | 143/1563 [01:36<14:24, 1.64it/s] 9%|▉ | 144/1563 [01:37<14:55, 1.58it/s] 9%|▉ | 145/1563 [01:37<13:48, 1.71it/s] 9%|▉ | 146/1563 [01:38<14:15, 1.66it/s] 9%|▉ | 147/1563 [01:39<15:53, 1.48it/s] 9%|▉ | 148/1563 [01:39<16:32, 1.43it/s] 10%|▉ | 149/1563 [01:40<16:38, 1.42it/s] 10%|▉ | 150/1563 [01:41<17:39, 1.33it/s] {'loss': 0.2407, 'grad_norm': 35.75, 'learning_rate': 1.8093410108765196e-05, 'epoch': 0.1} + 10%|▉ | 150/1563 [01:41<17:39, 1.33it/s] 10%|▉ | 151/1563 [01:41<15:54, 1.48it/s] 10%|▉ | 152/1563 [01:42<14:24, 1.63it/s] 10%|▉ | 153/1563 [01:43<16:06, 1.46it/s] 10%|▉ | 154/1563 [01:44<17:15, 1.36it/s] 10%|▉ | 155/1563 [01:44<18:07, 1.29it/s] 10%|▉ | 156/1563 [01:45<17:36, 1.33it/s] 10%|█ | 157/1563 [01:46<15:41, 1.49it/s] 10%|█ | 158/1563 [01:46<14:06, 1.66it/s] 10%|█ | 159/1563 [01:47<14:46, 1.58it/s] 10%|█ | 160/1563 [01:47<14:27, 1.62it/s] 10%|█ | 161/1563 [01:48<15:17, 1.53it/s] 10%|█ | 162/1563 [01:49<16:40, 1.40it/s] 10%|█ | 163/1563 [01:49<14:47, 1.58it/s] 10%|█ | 164/1563 [01:50<16:13, 1.44it/s] 11%|█ | 165/1563 [01:51<15:58, 1.46it/s] 11%|█ | 166/1563 [01:51<14:34, 1.60it/s] 11%|█ | 167/1563 [01:52<14:04, 1.65it/s] 11%|█ | 168/1563 [01:53<15:45, 1.47it/s] 11%|█ | 169/1563 [01:53<15:24, 1.51it/s] 11%|█ | 170/1563 [01:54<15:21, 1.51it/s] 11%|█ | 171/1563 [01:55<15:33, 1.49it/s] 11%|█ | 172/1563 [01:55<14:48, 1.57it/s] 11%|█ | 173/1563 [01:56<15:25, 1.50it/s] 11%|█ | 174/1563 [01:57<16:04, 1.44it/s] 11%|█ | 175/1563 [01:58<16:10, 1.43it/s] 11%|█▏ | 176/1563 [01:58<16:49, 1.37it/s] 11%|█▏ | 177/1563 [01:59<15:18, 1.51it/s] 11%|█▏ | 178/1563 [01:59<15:08, 1.52it/s] 11%|█▏ | 179/1563 [02:00<14:00, 1.65it/s] 12%|█▏ | 180/1563 [02:01<15:37, 1.48it/s] 12%|█▏ | 181/1563 [02:02<16:46, 1.37it/s] 12%|█▏ | 182/1563 [02:02<15:15, 1.51it/s] 12%|█▏ | 183/1563 [02:03<14:02, 1.64it/s] 12%|█▏ | 184/1563 [02:03<14:31, 1.58it/s] 12%|█▏ | 185/1563 [02:04<15:28, 1.48it/s] 12%|█▏ | 186/1563 [02:05<14:19, 1.60it/s] 12%|█▏ | 187/1563 [02:05<15:10, 1.51it/s] 12%|█▏ | 188/1563 [02:06<16:38, 1.38it/s] 12%|█▏ | 189/1563 [02:07<17:33, 1.30it/s] 12%|█▏ | 190/1563 [02:08<17:05, 1.34it/s] 12%|█▏ | 191/1563 [02:08<16:31, 1.38it/s] 12%|█▏ | 192/1563 [02:09<15:33, 1.47it/s] 12%|█▏ | 193/1563 [02:10<16:42, 1.37it/s] 12%|█▏ | 194/1563 [02:10<14:28, 1.58it/s] 12%|█▏ | 195/1563 [02:11<14:44, 1.55it/s] 13%|█▎ | 196/1563 [02:12<15:35, 1.46it/s] 13%|█▎ | 197/1563 [02:13<16:33, 1.38it/s] 13%|█▎ | 198/1563 [02:13<16:19, 1.39it/s] 13%|█▎ | 199/1563 [02:14<15:22, 1.48it/s] 13%|█▎ | 200/1563 [02:15<15:31, 1.46it/s] {'loss': 0.2004, 'grad_norm': 15.75, 'learning_rate': 1.7453614843250163e-05, 'epoch': 0.13} + 13%|█▎ | 200/1563 [02:15<15:31, 1.46it/s] 13%|█▎ | 201/1563 [02:15<15:57, 1.42it/s] 13%|█▎ | 202/1563 [02:16<13:55, 1.63it/s] 13%|█▎ | 203/1563 [02:17<15:24, 1.47it/s] 13%|█▎ | 204/1563 [02:17<15:31, 1.46it/s] 13%|█▎ | 205/1563 [02:18<14:52, 1.52it/s] 13%|█▎ | 206/1563 [02:19<16:07, 1.40it/s] 13%|█▎ | 207/1563 [02:19<15:53, 1.42it/s] 13%|█▎ | 208/1563 [02:20<14:59, 1.51it/s] 13%|█▎ | 209/1563 [02:21<16:04, 1.40it/s] 13%|█▎ | 210/1563 [02:22<16:47, 1.34it/s] 13%|█▎ | 211/1563 [02:22<15:35, 1.44it/s] 14%|█▎ | 212/1563 [02:23<14:01, 1.60it/s] 14%|█▎ | 213/1563 [02:23<12:59, 1.73it/s] 14%|█▎ | 214/1563 [02:24<14:01, 1.60it/s] 14%|█▍ | 215/1563 [02:25<19:18, 1.16it/s] 14%|█▍ | 216/1563 [02:26<16:14, 1.38it/s] 14%|█▍ | 217/1563 [02:26<14:18, 1.57it/s] 14%|█▍ | 218/1563 [02:26<12:54, 1.74it/s] 14%|█▍ | 219/1563 [02:27<13:47, 1.62it/s] 14%|█▍ | 220/1563 [02:28<16:03, 1.39it/s] 14%|█▍ | 221/1563 [02:29<14:43, 1.52it/s] 14%|█▍ | 222/1563 [02:30<16:01, 1.39it/s] 14%|█▍ | 223/1563 [02:30<14:35, 1.53it/s] 14%|█▍ | 224/1563 [02:31<15:52, 1.41it/s] 14%|█▍ | 225/1563 [02:32<15:35, 1.43it/s] 14%|█▍ | 226/1563 [02:32<14:46, 1.51it/s] 15%|█▍ | 227/1563 [02:33<14:25, 1.54it/s] 15%|█▍ | 228/1563 [02:33<14:36, 1.52it/s] 15%|█▍ | 229/1563 [02:34<15:21, 1.45it/s] 15%|█▍ | 230/1563 [02:35<15:52, 1.40it/s] 15%|█▍ | 231/1563 [02:36<16:19, 1.36it/s] 15%|█▍ | 232/1563 [02:36<15:32, 1.43it/s] 15%|█▍ | 233/1563 [02:37<16:35, 1.34it/s] 15%|█▍ | 234/1563 [02:38<16:09, 1.37it/s] 15%|█▌ | 235/1563 [02:39<16:59, 1.30it/s] 15%|█▌ | 236/1563 [02:39<16:18, 1.36it/s] 15%|█▌ | 237/1563 [02:40<16:31, 1.34it/s] 15%|█▌ | 238/1563 [02:41<14:31, 1.52it/s] 15%|█▌ | 239/1563 [02:41<15:41, 1.41it/s] 15%|█▌ | 240/1563 [02:42<15:36, 1.41it/s] 15%|█▌ | 241/1563 [02:43<14:45, 1.49it/s] 15%|█▌ | 242/1563 [02:43<13:03, 1.69it/s] 16%|█▌ | 243/1563 [02:44<14:04, 1.56it/s] 16%|█▌ | 244/1563 [02:45<14:25, 1.52it/s] 16%|█▌ | 245/1563 [02:45<13:26, 1.63it/s] 16%|█▌ | 246/1563 [02:46<15:05, 1.45it/s] 16%|█▌ | 247/1563 [02:47<15:54, 1.38it/s] 16%|█▌ | 248/1563 [02:48<15:49, 1.38it/s] 16%|█▌ | 249/1563 [02:48<13:53, 1.58it/s] 16%|█▌ | 250/1563 [02:49<13:57, 1.57it/s] {'loss': 0.1715, 'grad_norm': 46.75, 'learning_rate': 1.6813819577735126e-05, 'epoch': 0.16} + 16%|█▌ | 250/1563 [02:49<13:57, 1.57it/s] 16%|█▌ | 251/1563 [02:49<12:42, 1.72it/s] 16%|█▌ | 252/1563 [02:50<12:43, 1.72it/s] 16%|█▌ | 253/1563 [02:50<14:31, 1.50it/s] 16%|█▋ | 254/1563 [02:51<13:25, 1.63it/s] 16%|█▋ | 255/1563 [02:52<13:25, 1.62it/s] 16%|█▋ | 256/1563 [02:52<13:46, 1.58it/s] 16%|█▋ | 257/1563 [02:53<15:01, 1.45it/s] 17%|█▋ | 258/1563 [02:54<13:35, 1.60it/s] 17%|█▋ | 259/1563 [02:54<12:30, 1.74it/s] 17%|█▋ | 260/1563 [02:55<12:13, 1.78it/s] 17%|█▋ | 261/1563 [02:55<13:58, 1.55it/s] 17%|█▋ | 262/1563 [02:56<12:48, 1.69it/s] 17%|█▋ | 263/1563 [02:56<11:57, 1.81it/s] 17%|█▋ | 264/1563 [02:57<11:28, 1.89it/s] 17%|█▋ | 265/1563 [02:57<11:35, 1.87it/s] 17%|█▋ | 266/1563 [02:58<12:33, 1.72it/s] 17%|█▋ | 267/1563 [02:59<13:18, 1.62it/s] 17%|█▋ | 268/1563 [02:59<13:27, 1.60it/s] 17%|█▋ | 269/1563 [03:00<14:45, 1.46it/s] 17%|█▋ | 270/1563 [03:01<13:20, 1.62it/s] 17%|█▋ | 271/1563 [03:01<14:19, 1.50it/s] 17%|█▋ | 272/1563 [03:02<14:26, 1.49it/s] 17%|█▋ | 273/1563 [03:03<14:37, 1.47it/s] 18%|█▊ | 274/1563 [03:03<13:12, 1.63it/s] 18%|█▊ | 275/1563 [03:04<13:28, 1.59it/s] 18%|█▊ | 276/1563 [03:04<12:40, 1.69it/s] 18%|█▊ | 277/1563 [03:05<13:59, 1.53it/s] 18%|█▊ | 278/1563 [03:06<15:20, 1.40it/s] 18%|█▊ | 279/1563 [03:07<15:50, 1.35it/s] 18%|█▊ | 280/1563 [03:08<14:48, 1.44it/s] 18%|█▊ | 281/1563 [03:08<15:05, 1.42it/s] 18%|█▊ | 282/1563 [03:09<14:52, 1.44it/s] 18%|█▊ | 283/1563 [03:10<14:43, 1.45it/s] 18%|█▊ | 284/1563 [03:10<14:45, 1.44it/s] 18%|█▊ | 285/1563 [03:11<15:20, 1.39it/s] 18%|█▊ | 286/1563 [03:12<13:54, 1.53it/s] 18%|█▊ | 287/1563 [03:12<15:09, 1.40it/s] 18%|█▊ | 288/1563 [03:13<15:47, 1.35it/s] 18%|█▊ | 289/1563 [03:14<16:30, 1.29it/s] 19%|█▊ | 290/1563 [03:15<14:38, 1.45it/s] 19%|█▊ | 291/1563 [03:15<14:59, 1.41it/s] 19%|█▊ | 292/1563 [03:16<15:28, 1.37it/s] 19%|█▊ | 293/1563 [03:17<16:18, 1.30it/s] 19%|█▉ | 294/1563 [03:18<16:46, 1.26it/s] 19%|█▉ | 295/1563 [03:19<17:08, 1.23it/s] 19%|█▉ | 296/1563 [03:19<16:56, 1.25it/s] 19%|█▉ | 297/1563 [03:20<17:15, 1.22it/s] 19%|█▉ | 298/1563 [03:21<16:55, 1.25it/s] 19%|█▉ | 299/1563 [03:22<16:52, 1.25it/s] 19%|█▉ | 300/1563 [03:23<16:56, 1.24it/s] {'loss': 0.1797, 'grad_norm': 426.0, 'learning_rate': 1.6174024312220092e-05, 'epoch': 0.19} + 19%|█▉ | 300/1563 [03:23<16:56, 1.24it/s] 19%|█▉ | 301/1563 [03:23<16:20, 1.29it/s] 19%|█▉ | 302/1563 [03:24<14:23, 1.46it/s] 19%|█▉ | 303/1563 [03:24<13:27, 1.56it/s] 19%|█▉ | 304/1563 [03:25<14:08, 1.48it/s] 20%|█▉ | 305/1563 [03:26<14:31, 1.44it/s] 20%|█▉ | 306/1563 [03:26<13:54, 1.51it/s] 20%|█▉ | 307/1563 [03:27<14:57, 1.40it/s] 20%|█▉ | 308/1563 [03:28<13:29, 1.55it/s] 20%|█▉ | 309/1563 [03:29<14:32, 1.44it/s] 20%|█▉ | 310/1563 [03:29<13:09, 1.59it/s] 20%|█▉ | 311/1563 [03:30<12:50, 1.63it/s] 20%|█▉ | 312/1563 [03:31<14:15, 1.46it/s] 20%|██ | 313/1563 [03:31<13:35, 1.53it/s] 20%|██ | 314/1563 [03:32<14:42, 1.42it/s] 20%|██ | 315/1563 [03:33<13:56, 1.49it/s] 20%|██ | 316/1563 [03:33<12:51, 1.62it/s] 20%|██ | 317/1563 [03:34<13:17, 1.56it/s] 20%|██ | 318/1563 [03:34<12:32, 1.65it/s] 20%|██ | 319/1563 [03:35<12:54, 1.61it/s] 20%|██ | 320/1563 [03:36<14:20, 1.45it/s] 21%|██ | 321/1563 [03:36<14:23, 1.44it/s] 21%|██ | 322/1563 [03:37<15:01, 1.38it/s] 21%|██ | 323/1563 [03:38<14:33, 1.42it/s] 21%|██ | 324/1563 [03:39<14:38, 1.41it/s] 21%|██ | 325/1563 [03:39<14:12, 1.45it/s] 21%|██ | 326/1563 [03:40<14:42, 1.40it/s] 21%|██ | 327/1563 [03:41<15:25, 1.34it/s] 21%|██ | 328/1563 [03:41<13:30, 1.52it/s] 21%|██ | 329/1563 [03:42<14:33, 1.41it/s] 21%|██ | 330/1563 [03:43<13:58, 1.47it/s] 21%|██ | 331/1563 [03:43<12:31, 1.64it/s] 21%|██ | 332/1563 [03:44<11:24, 1.80it/s] 21%|██▏ | 333/1563 [03:44<12:48, 1.60it/s] 21%|██▏ | 334/1563 [03:45<14:10, 1.44it/s] 21%|██▏ | 335/1563 [03:46<12:57, 1.58it/s] 21%|██▏ | 336/1563 [03:47<14:23, 1.42it/s] 22%|██▏ | 337/1563 [03:47<13:23, 1.53it/s] 22%|██▏ | 338/1563 [03:48<14:32, 1.40it/s] 22%|██▏ | 339/1563 [03:49<14:30, 1.41it/s] 22%|██▏ | 340/1563 [03:49<13:30, 1.51it/s] 22%|██▏ | 341/1563 [03:50<14:45, 1.38it/s] 22%|██▏ | 342/1563 [03:51<15:33, 1.31it/s] 22%|██▏ | 343/1563 [03:52<15:00, 1.35it/s] 22%|██▏ | 344/1563 [03:52<13:22, 1.52it/s] 22%|██▏ | 345/1563 [03:53<14:24, 1.41it/s] 22%|██▏ | 346/1563 [03:54<14:29, 1.40it/s] 22%|██▏ | 347/1563 [03:55<15:21, 1.32it/s] 22%|██▏ | 348/1563 [03:55<15:05, 1.34it/s] 22%|██▏ | 349/1563 [03:56<15:40, 1.29it/s] 22%|██▏ | 350/1563 [03:57<15:11, 1.33it/s] {'loss': 0.1763, 'grad_norm': 15.5, 'learning_rate': 1.5534229046705055e-05, 'epoch': 0.22} + 22%|██▏ | 350/1563 [03:57<15:11, 1.33it/s] 22%|██▏ | 351/1563 [03:58<15:22, 1.31it/s] 23%|██▎ | 352/1563 [03:58<15:50, 1.27it/s] 23%|██▎ | 353/1563 [03:59<16:13, 1.24it/s] 23%|██▎ | 354/1563 [04:00<16:26, 1.23it/s] 23%|██▎ | 355/1563 [04:01<16:10, 1.24it/s] 23%|██▎ | 356/1563 [04:02<15:51, 1.27it/s] 23%|██▎ | 357/1563 [04:02<14:15, 1.41it/s] 23%|██▎ | 358/1563 [04:03<15:13, 1.32it/s] 23%|██▎ | 359/1563 [04:04<14:31, 1.38it/s] 23%|██▎ | 360/1563 [04:05<15:00, 1.34it/s] 23%|██▎ | 361/1563 [04:05<15:14, 1.31it/s] 23%|██▎ | 362/1563 [04:06<15:42, 1.27it/s] 23%|██▎ | 363/1563 [04:07<14:53, 1.34it/s] 23%|██▎ | 364/1563 [04:07<13:19, 1.50it/s] 23%|██▎ | 365/1563 [04:08<12:23, 1.61it/s] 23%|██▎ | 366/1563 [04:09<13:18, 1.50it/s] 23%|██▎ | 367/1563 [04:09<13:40, 1.46it/s] 24%|██▎ | 368/1563 [04:10<14:45, 1.35it/s] 24%|██▎ | 369/1563 [04:11<13:10, 1.51it/s] 24%|██▎ | 370/1563 [04:11<11:57, 1.66it/s] 24%|██▎ | 371/1563 [04:12<10:57, 1.81it/s] 24%|██▍ | 372/1563 [04:12<12:48, 1.55it/s] 24%|██▍ | 373/1563 [04:13<11:45, 1.69it/s] 24%|██▍ | 374/1563 [04:14<13:16, 1.49it/s] 24%|██▍ | 375/1563 [04:15<14:10, 1.40it/s] 24%|██▍ | 376/1563 [04:15<12:36, 1.57it/s] 24%|██▍ | 377/1563 [04:15<11:27, 1.73it/s] 24%|██▍ | 378/1563 [04:16<10:35, 1.87it/s] 24%|██▍ | 379/1563 [04:17<11:48, 1.67it/s] 24%|██▍ | 380/1563 [04:17<12:16, 1.61it/s] 24%|██▍ | 381/1563 [04:18<13:34, 1.45it/s] 24%|██▍ | 382/1563 [04:19<14:14, 1.38it/s] 25%|██▍ | 383/1563 [04:20<14:51, 1.32it/s] 25%|██▍ | 384/1563 [04:20<13:13, 1.49it/s] 25%|██▍ | 385/1563 [04:21<13:22, 1.47it/s] 25%|██▍ | 386/1563 [04:22<12:39, 1.55it/s] 25%|██▍ | 387/1563 [04:22<12:14, 1.60it/s] 25%|██▍ | 388/1563 [04:23<13:11, 1.48it/s] 25%|██▍ | 389/1563 [04:23<12:14, 1.60it/s] 25%|██▍ | 390/1563 [04:24<12:42, 1.54it/s] 25%|██▌ | 391/1563 [04:25<11:47, 1.66it/s] 25%|██▌ | 392/1563 [04:25<13:14, 1.47it/s] 25%|██▌ | 393/1563 [04:26<14:07, 1.38it/s] 25%|██▌ | 394/1563 [04:27<14:41, 1.33it/s] 25%|██▌ | 395/1563 [04:28<13:05, 1.49it/s] 25%|██▌ | 396/1563 [04:28<13:10, 1.48it/s] 25%|██▌ | 397/1563 [04:29<12:32, 1.55it/s] 25%|██▌ | 398/1563 [04:29<11:35, 1.67it/s] 26%|██▌ | 399/1563 [04:30<13:00, 1.49it/s] 26%|██▌ | 400/1563 [04:31<12:21, 1.57it/s] {'loss': 0.161, 'grad_norm': 11.1875, 'learning_rate': 1.4894433781190021e-05, 'epoch': 0.26} + 26%|██▌ | 400/1563 [04:31<12:21, 1.57it/s] 26%|██▌ | 401/1563 [04:32<13:32, 1.43it/s] 26%|██▌ | 402/1563 [04:32<12:17, 1.57it/s] 26%|██▌ | 403/1563 [04:33<12:32, 1.54it/s] 26%|██▌ | 404/1563 [04:33<11:44, 1.65it/s] 26%|██▌ | 405/1563 [04:34<12:05, 1.60it/s] 26%|██▌ | 406/1563 [04:35<13:24, 1.44it/s] 26%|██▌ | 407/1563 [04:36<14:06, 1.37it/s] 26%|██▌ | 408/1563 [04:36<14:51, 1.30it/s] 26%|██▌ | 409/1563 [04:37<13:27, 1.43it/s] 26%|██▌ | 410/1563 [04:38<14:16, 1.35it/s] 26%|██▋ | 411/1563 [04:38<12:56, 1.48it/s] 26%|██▋ | 412/1563 [04:39<11:49, 1.62it/s] 26%|██▋ | 413/1563 [04:40<12:49, 1.50it/s] 26%|██▋ | 414/1563 [04:40<11:28, 1.67it/s] 27%|██▋ | 415/1563 [04:41<11:20, 1.69it/s] 27%|██▋ | 416/1563 [04:41<10:38, 1.80it/s] 27%|██▋ | 417/1563 [04:42<12:18, 1.55it/s] 27%|██▋ | 418/1563 [04:43<13:24, 1.42it/s] 27%|██▋ | 419/1563 [04:44<14:15, 1.34it/s] 27%|██▋ | 420/1563 [04:44<14:23, 1.32it/s] 27%|██▋ | 421/1563 [04:45<13:54, 1.37it/s] 27%|██▋ | 422/1563 [04:46<14:35, 1.30it/s] 27%|██▋ | 423/1563 [04:46<12:49, 1.48it/s] 27%|██▋ | 424/1563 [04:47<11:53, 1.60it/s] 27%|██▋ | 425/1563 [04:48<13:07, 1.44it/s] 27%|██▋ | 426/1563 [04:49<14:02, 1.35it/s] 27%|██▋ | 427/1563 [04:49<12:18, 1.54it/s] 27%|██▋ | 428/1563 [04:50<13:15, 1.43it/s] 27%|██▋ | 429/1563 [04:51<13:03, 1.45it/s] 28%|██▊ | 430/1563 [04:51<13:23, 1.41it/s] 28%|██▊ | 431/1563 [04:52<12:00, 1.57it/s] 28%|██▊ | 432/1563 [04:52<10:46, 1.75it/s] 28%|██▊ | 433/1563 [04:53<10:25, 1.81it/s] 28%|██▊ | 434/1563 [04:53<10:12, 1.84it/s] 28%|██▊ | 435/1563 [04:54<10:20, 1.82it/s] 28%|██▊ | 436/1563 [04:55<11:52, 1.58it/s] 28%|██▊ | 437/1563 [04:55<11:02, 1.70it/s] 28%|██▊ | 438/1563 [04:56<11:45, 1.60it/s] 28%|██▊ | 439/1563 [04:56<11:15, 1.66it/s] 28%|██▊ | 440/1563 [04:57<10:50, 1.73it/s] 28%|██▊ | 441/1563 [04:57<10:17, 1.82it/s] 28%|██▊ | 442/1563 [04:58<09:41, 1.93it/s] 28%|██▊ | 443/1563 [04:58<09:28, 1.97it/s] 28%|██▊ | 444/1563 [04:59<09:49, 1.90it/s] 28%|██▊ | 445/1563 [05:00<11:30, 1.62it/s] 29%|██▊ | 446/1563 [05:00<12:09, 1.53it/s] 29%|██▊ | 447/1563 [05:01<12:46, 1.46it/s] 29%|██▊ | 448/1563 [05:02<12:56, 1.44it/s] 29%|██▊ | 449/1563 [05:02<12:07, 1.53it/s] 29%|██▉ | 450/1563 [05:03<12:29, 1.48it/s] {'loss': 0.1574, 'grad_norm': 10.0, 'learning_rate': 1.4254638515674986e-05, 'epoch': 0.29} + 29%|██▉ | 450/1563 [05:03<12:29, 1.48it/s] 29%|██▉ | 451/1563 [05:04<11:21, 1.63it/s] 29%|██▉ | 452/1563 [05:04<11:50, 1.56it/s] 29%|██▉ | 453/1563 [05:05<10:48, 1.71it/s] 29%|██▉ | 454/1563 [05:05<11:01, 1.68it/s] 29%|██▉ | 455/1563 [05:06<10:35, 1.74it/s] 29%|██▉ | 456/1563 [05:06<10:06, 1.83it/s] 29%|██▉ | 457/1563 [05:07<09:33, 1.93it/s] 29%|██▉ | 458/1563 [05:07<09:13, 2.00it/s] 29%|██▉ | 459/1563 [05:08<11:04, 1.66it/s] 29%|██▉ | 460/1563 [05:09<12:27, 1.48it/s] 29%|██▉ | 461/1563 [05:09<11:10, 1.64it/s] 30%|██▉ | 462/1563 [05:10<12:31, 1.46it/s] 30%|██▉ | 463/1563 [05:11<11:06, 1.65it/s] 30%|██▉ | 464/1563 [05:11<10:13, 1.79it/s] 30%|██▉ | 465/1563 [05:12<11:23, 1.61it/s] 30%|██▉ | 466/1563 [05:13<12:42, 1.44it/s] 30%|██▉ | 467/1563 [05:13<11:45, 1.55it/s] 30%|██▉ | 468/1563 [05:14<10:42, 1.70it/s] 30%|███ | 469/1563 [05:15<12:05, 1.51it/s] 30%|███ | 470/1563 [05:15<11:09, 1.63it/s] 30%|███ | 471/1563 [05:16<10:37, 1.71it/s] 30%|███ | 472/1563 [05:16<09:57, 1.83it/s] 30%|███ | 473/1563 [05:17<11:34, 1.57it/s] 30%|███ | 474/1563 [05:17<10:22, 1.75it/s] 30%|███ | 475/1563 [05:18<10:05, 1.80it/s] 30%|███ | 476/1563 [05:18<09:17, 1.95it/s] 31%|███ | 477/1563 [05:19<09:16, 1.95it/s] 31%|███ | 478/1563 [05:19<08:50, 2.05it/s] 31%|███ | 479/1563 [05:20<09:21, 1.93it/s] 31%|███ | 480/1563 [05:21<11:09, 1.62it/s] 31%|███ | 481/1563 [05:21<11:53, 1.52it/s] 31%|███ | 482/1563 [05:22<10:46, 1.67it/s] 31%|███ | 483/1563 [05:23<11:17, 1.59it/s] 31%|███ | 484/1563 [05:23<10:59, 1.64it/s] 31%|███ | 485/1563 [05:24<10:42, 1.68it/s] 31%|███ | 486/1563 [05:24<09:43, 1.84it/s] 31%|███ | 487/1563 [05:25<11:05, 1.62it/s] 31%|███ | 488/1563 [05:26<11:55, 1.50it/s] 31%|███▏ | 489/1563 [05:27<12:38, 1.42it/s] 31%|███▏ | 490/1563 [05:27<13:14, 1.35it/s] 31%|███▏ | 491/1563 [05:28<11:58, 1.49it/s] 31%|███▏ | 492/1563 [05:28<11:14, 1.59it/s] 32%|███▏ | 493/1563 [05:29<10:24, 1.71it/s] 32%|███▏ | 494/1563 [05:30<11:29, 1.55it/s] 32%|███▏ | 495/1563 [05:30<10:40, 1.67it/s] 32%|███▏ | 496/1563 [05:31<11:57, 1.49it/s] 32%|███▏ | 497/1563 [05:31<10:45, 1.65it/s] 32%|███▏ | 498/1563 [05:32<10:50, 1.64it/s] 32%|███▏ | 499/1563 [05:33<11:16, 1.57it/s] 32%|███▏ | 500/1563 [05:34<12:29, 1.42it/s] {'loss': 0.15, 'grad_norm': 14.0625, 'learning_rate': 1.361484325015995e-05, 'epoch': 0.32} + 32%|███▏ | 500/1563 [05:34<12:29, 1.42it/s] 32%|███▏ | 501/1563 [05:34<12:59, 1.36it/s] 32%|███▏ | 502/1563 [05:35<11:37, 1.52it/s] 32%|███▏ | 503/1563 [05:35<10:59, 1.61it/s] 32%|███▏ | 504/1563 [05:36<11:44, 1.50it/s] 32%|███▏ | 505/1563 [05:37<12:49, 1.38it/s] 32%|███▏ | 506/1563 [05:38<13:18, 1.32it/s] 32%|███▏ | 507/1563 [05:39<13:42, 1.28it/s] 33%|███▎ | 508/1563 [05:40<13:45, 1.28it/s] 33%|███▎ | 509/1563 [05:40<14:00, 1.25it/s] 33%|███▎ | 510/1563 [05:41<12:55, 1.36it/s] 33%|███▎ | 511/1563 [05:42<11:46, 1.49it/s] 33%|███▎ | 512/1563 [05:42<10:55, 1.60it/s] 33%|███▎ | 513/1563 [05:43<10:57, 1.60it/s] 33%|███▎ | 514/1563 [05:43<10:52, 1.61it/s] 33%|███▎ | 515/1563 [05:44<10:08, 1.72it/s] 33%|███▎ | 516/1563 [05:44<09:46, 1.79it/s] 33%|███▎ | 517/1563 [05:45<10:17, 1.70it/s] 33%|███▎ | 518/1563 [05:45<10:06, 1.72it/s] 33%|███▎ | 519/1563 [05:46<10:19, 1.69it/s] 33%|███▎ | 520/1563 [05:47<10:16, 1.69it/s] 33%|███▎ | 521/1563 [05:47<11:05, 1.56it/s] 33%|███▎ | 522/1563 [05:48<11:32, 1.50it/s] 33%|███▎ | 523/1563 [05:49<12:23, 1.40it/s] 34%|███▎ | 524/1563 [05:50<12:38, 1.37it/s] 34%|███▎ | 525/1563 [05:50<11:23, 1.52it/s] 34%|███▎ | 526/1563 [05:51<10:31, 1.64it/s] 34%|███▎ | 527/1563 [05:51<09:47, 1.76it/s] 34%|███▍ | 528/1563 [05:52<09:32, 1.81it/s] 34%|███▍ | 529/1563 [05:53<10:46, 1.60it/s] 34%|███▍ | 530/1563 [05:53<11:03, 1.56it/s] 34%|███▍ | 531/1563 [05:54<11:41, 1.47it/s] 34%|███▍ | 532/1563 [05:55<12:23, 1.39it/s] 34%|███▍ | 533/1563 [05:55<11:39, 1.47it/s] 34%|███▍ | 534/1563 [05:56<10:35, 1.62it/s] 34%|███▍ | 535/1563 [05:57<11:36, 1.48it/s] 34%|███▍ | 536/1563 [05:57<10:32, 1.62it/s] 34%|███▍ | 537/1563 [05:58<11:41, 1.46it/s] 34%|███▍ | 538/1563 [05:58<10:38, 1.61it/s] 34%|███▍ | 539/1563 [05:59<10:42, 1.59it/s] 35%|███▍ | 540/1563 [06:00<11:06, 1.54it/s] 35%|███▍ | 541/1563 [06:00<10:50, 1.57it/s] 35%|███▍ | 542/1563 [06:01<10:15, 1.66it/s] 35%|███▍ | 543/1563 [06:02<11:19, 1.50it/s] 35%|███▍ | 544/1563 [06:03<11:52, 1.43it/s] 35%|███▍ | 545/1563 [06:03<12:37, 1.34it/s] 35%|███▍ | 546/1563 [06:04<13:11, 1.28it/s] 35%|███▍ | 547/1563 [06:05<12:41, 1.33it/s] 35%|███▌ | 548/1563 [06:05<11:05, 1.52it/s] 35%|███▌ | 549/1563 [06:06<11:03, 1.53it/s] 35%|███▌ | 550/1563 [06:06<09:48, 1.72it/s] {'loss': 0.1867, 'grad_norm': 13.75, 'learning_rate': 1.2975047984644915e-05, 'epoch': 0.35} + 35%|███▌ | 550/1563 [06:06<09:48, 1.72it/s] 35%|███▌ | 551/1563 [06:07<11:12, 1.50it/s] 35%|███▌ | 552/1563 [06:08<10:50, 1.55it/s] 35%|███▌ | 553/1563 [06:09<10:59, 1.53it/s] 35%|███▌ | 554/1563 [06:09<11:31, 1.46it/s] 36%|███▌ | 555/1563 [06:10<11:40, 1.44it/s] 36%|███▌ | 556/1563 [06:10<10:33, 1.59it/s] 36%|███▌ | 557/1563 [06:11<10:28, 1.60it/s] 36%|███▌ | 558/1563 [06:12<11:32, 1.45it/s] 36%|███▌ | 559/1563 [06:13<11:37, 1.44it/s] 36%|███▌ | 560/1563 [06:14<12:20, 1.35it/s] 36%|███▌ | 561/1563 [06:14<12:42, 1.31it/s] 36%|███▌ | 562/1563 [06:15<11:38, 1.43it/s] 36%|███▌ | 563/1563 [06:16<11:40, 1.43it/s] 36%|███▌ | 564/1563 [06:16<12:18, 1.35it/s] 36%|███▌ | 565/1563 [06:17<11:10, 1.49it/s] 36%|███▌ | 566/1563 [06:18<12:09, 1.37it/s] 36%|███▋ | 567/1563 [06:19<12:52, 1.29it/s] 36%|███▋ | 568/1563 [06:19<11:18, 1.47it/s] 36%|███▋ | 569/1563 [06:20<11:38, 1.42it/s] 36%|███▋ | 570/1563 [06:20<10:33, 1.57it/s] 37%|███▋ | 571/1563 [06:21<11:05, 1.49it/s] 37%|███▋ | 572/1563 [06:22<11:43, 1.41it/s] 37%|███▋ | 573/1563 [06:22<10:48, 1.53it/s] 37%|███▋ | 574/1563 [06:23<11:46, 1.40it/s] 37%|███▋ | 575/1563 [06:24<11:43, 1.40it/s] 37%|███▋ | 576/1563 [06:25<11:26, 1.44it/s] 37%|███▋ | 577/1563 [06:25<10:45, 1.53it/s] 37%|███▋ | 578/1563 [06:26<10:57, 1.50it/s] 37%|███▋ | 579/1563 [06:27<11:52, 1.38it/s] 37%|███▋ | 580/1563 [06:28<12:27, 1.31it/s] 37%|███▋ | 581/1563 [06:28<11:47, 1.39it/s] 37%|███▋ | 582/1563 [06:29<12:19, 1.33it/s] 37%|███▋ | 583/1563 [06:30<11:35, 1.41it/s] 37%|███▋ | 584/1563 [06:30<10:25, 1.56it/s] 37%|███▋ | 585/1563 [06:31<11:24, 1.43it/s] 37%|███▋ | 586/1563 [06:32<11:40, 1.39it/s] 38%|███▊ | 587/1563 [06:32<10:20, 1.57it/s] 38%|███▊ | 588/1563 [06:33<10:40, 1.52it/s] 38%|███▊ | 589/1563 [06:34<10:56, 1.48it/s] 38%|███▊ | 590/1563 [06:34<10:28, 1.55it/s] 38%|███▊ | 591/1563 [06:35<09:35, 1.69it/s] 38%|███▊ | 592/1563 [06:35<10:33, 1.53it/s] 38%|███▊ | 593/1563 [06:36<11:26, 1.41it/s] 38%|███▊ | 594/1563 [06:37<11:16, 1.43it/s] 38%|███▊ | 595/1563 [06:37<10:10, 1.59it/s] 38%|███▊ | 596/1563 [06:38<11:18, 1.43it/s] 38%|███▊ | 597/1563 [06:39<11:19, 1.42it/s] 38%|███▊ | 598/1563 [06:40<11:15, 1.43it/s] 38%|███▊ | 599/1563 [06:40<10:47, 1.49it/s] 38%|███▊ | 600/1563 [06:41<10:46, 1.49it/s] {'loss': 0.1441, 'grad_norm': 12.9375, 'learning_rate': 1.233525271912988e-05, 'epoch': 0.38} + 38%|███▊ | 600/1563 [06:41<10:46, 1.49it/s] 38%|███▊ | 601/1563 [06:42<10:20, 1.55it/s] 39%|███▊ | 602/1563 [06:42<10:39, 1.50it/s] 39%|███▊ | 603/1563 [06:43<11:30, 1.39it/s] 39%|███▊ | 604/1563 [06:44<10:25, 1.53it/s] 39%|███▊ | 605/1563 [06:44<09:32, 1.67it/s] 39%|███▉ | 606/1563 [06:45<09:12, 1.73it/s] 39%|███▉ | 607/1563 [06:45<10:15, 1.55it/s] 39%|███▉ | 608/1563 [06:46<09:38, 1.65it/s] 39%|███▉ | 609/1563 [06:46<08:56, 1.78it/s] 39%|███▉ | 610/1563 [06:47<08:35, 1.85it/s] 39%|███▉ | 611/1563 [06:47<08:37, 1.84it/s] 39%|███▉ | 612/1563 [06:48<08:20, 1.90it/s] 39%|███▉ | 613/1563 [06:49<08:40, 1.83it/s] 39%|███▉ | 614/1563 [06:49<09:41, 1.63it/s] 39%|███▉ | 615/1563 [06:50<09:58, 1.58it/s] 39%|███▉ | 616/1563 [06:51<10:42, 1.47it/s] 39%|███▉ | 617/1563 [06:51<09:47, 1.61it/s] 40%|███▉ | 618/1563 [06:52<10:25, 1.51it/s] 40%|███▉ | 619/1563 [06:53<10:27, 1.51it/s] 40%|███▉ | 620/1563 [06:53<09:28, 1.66it/s] 40%|███▉ | 621/1563 [06:54<09:50, 1.59it/s] 40%|███▉ | 622/1563 [06:55<11:01, 1.42it/s] 40%|███▉ | 623/1563 [06:55<09:59, 1.57it/s] 40%|███▉ | 624/1563 [06:56<09:58, 1.57it/s] 40%|███▉ | 625/1563 [06:57<10:15, 1.53it/s] 40%|████ | 626/1563 [06:57<10:34, 1.48it/s] 40%|████ | 627/1563 [06:58<10:25, 1.50it/s] 40%|████ | 628/1563 [06:59<11:19, 1.38it/s] 40%|████ | 629/1563 [06:59<10:25, 1.49it/s] 40%|████ | 630/1563 [07:00<11:17, 1.38it/s] 40%|████ | 631/1563 [07:01<10:28, 1.48it/s] 40%|████ | 632/1563 [07:01<09:21, 1.66it/s] 40%|████ | 633/1563 [07:02<10:06, 1.53it/s] 41%|████ | 634/1563 [07:02<09:23, 1.65it/s] 41%|████ | 635/1563 [07:03<09:06, 1.70it/s] 41%|████ | 636/1563 [07:03<08:36, 1.79it/s] 41%|████ | 637/1563 [07:04<10:00, 1.54it/s] 41%|████ | 638/1563 [07:05<10:37, 1.45it/s] 41%|████ | 639/1563 [07:06<09:51, 1.56it/s] 41%|████ | 640/1563 [07:06<09:47, 1.57it/s] 41%|████ | 641/1563 [07:07<09:04, 1.69it/s] 41%|████ | 642/1563 [07:07<09:35, 1.60it/s] 41%|████ | 643/1563 [07:08<09:14, 1.66it/s] 41%|████ | 644/1563 [07:09<09:42, 1.58it/s] 41%|████▏ | 645/1563 [07:10<10:44, 1.42it/s] 41%|████▏ | 646/1563 [07:10<09:49, 1.56it/s] 41%|████▏ | 647/1563 [07:11<09:37, 1.59it/s] 41%|████▏ | 648/1563 [07:12<10:40, 1.43it/s] 42%|████▏ | 649/1563 [07:12<11:21, 1.34it/s] 42%|████▏ | 650/1563 [07:13<10:52, 1.40it/s] {'loss': 0.1872, 'grad_norm': 23.0, 'learning_rate': 1.1695457453614845e-05, 'epoch': 0.42} + 42%|████▏ | 650/1563 [07:13<10:52, 1.40it/s] 42%|████▏ | 651/1563 [07:14<09:55, 1.53it/s] 42%|████▏ | 652/1563 [07:14<08:53, 1.71it/s] 42%|████▏ | 653/1563 [07:15<10:06, 1.50it/s] 42%|████▏ | 654/1563 [07:15<09:14, 1.64it/s] 42%|████▏ | 655/1563 [07:16<08:38, 1.75it/s] 42%|████▏ | 656/1563 [07:17<09:55, 1.52it/s] 42%|████▏ | 657/1563 [07:17<10:19, 1.46it/s] 42%|████▏ | 658/1563 [07:18<11:02, 1.37it/s] 42%|████▏ | 659/1563 [07:19<11:41, 1.29it/s] 42%|████▏ | 660/1563 [07:20<10:26, 1.44it/s] 42%|████▏ | 661/1563 [07:20<10:40, 1.41it/s] 42%|████▏ | 662/1563 [07:21<09:19, 1.61it/s] 42%|████▏ | 663/1563 [07:21<09:34, 1.57it/s] 42%|████▏ | 664/1563 [07:22<09:48, 1.53it/s] 43%|████▎ | 665/1563 [07:23<09:10, 1.63it/s] 43%|████▎ | 666/1563 [07:23<08:53, 1.68it/s] 43%|████▎ | 667/1563 [07:24<08:22, 1.78it/s] 43%|████▎ | 668/1563 [07:24<09:24, 1.59it/s] 43%|████▎ | 669/1563 [07:25<08:44, 1.70it/s] 43%|████▎ | 670/1563 [07:25<08:02, 1.85it/s] 43%|████▎ | 671/1563 [07:26<07:39, 1.94it/s] 43%|████▎ | 672/1563 [07:26<07:15, 2.04it/s] 43%|████▎ | 673/1563 [07:27<07:24, 2.00it/s] 43%|████▎ | 674/1563 [07:28<08:56, 1.66it/s] 43%|████▎ | 675/1563 [07:28<09:45, 1.52it/s] 43%|████▎ | 676/1563 [07:29<09:01, 1.64it/s] 43%|████▎ | 677/1563 [07:30<09:13, 1.60it/s] 43%|████▎ | 678/1563 [07:30<08:19, 1.77it/s] 43%|████▎ | 679/1563 [07:31<08:45, 1.68it/s] 44%|████▎ | 680/1563 [07:31<09:09, 1.61it/s] 44%|████▎ | 681/1563 [07:32<08:09, 1.80it/s] 44%|████▎ | 682/1563 [07:32<07:43, 1.90it/s] 44%|████▎ | 683/1563 [07:33<08:21, 1.76it/s] 44%|████▍ | 684/1563 [07:34<08:38, 1.69it/s] 44%|████▍ | 685/1563 [07:34<09:01, 1.62it/s] 44%|████▍ | 686/1563 [07:35<08:16, 1.77it/s] 44%|████▍ | 687/1563 [07:35<08:04, 1.81it/s] 44%|████▍ | 688/1563 [07:36<08:43, 1.67it/s] 44%|████▍ | 689/1563 [07:37<09:25, 1.55it/s] 44%|████▍ | 690/1563 [07:38<10:21, 1.41it/s] 44%|████▍ | 691/1563 [07:38<10:58, 1.32it/s] 44%|████▍ | 692/1563 [07:39<09:57, 1.46it/s] 44%|████▍ | 693/1563 [07:39<09:03, 1.60it/s] 44%|████▍ | 694/1563 [07:40<08:21, 1.73it/s] 44%|████▍ | 695/1563 [07:40<08:14, 1.76it/s] 45%|████▍ | 696/1563 [07:41<07:46, 1.86it/s] 45%|████▍ | 697/1563 [07:41<07:29, 1.93it/s] 45%|████▍ | 698/1563 [07:42<07:40, 1.88it/s] 45%|████▍ | 699/1563 [07:42<07:29, 1.92it/s] 45%|████▍ | 700/1563 [07:43<07:07, 2.02it/s] {'loss': 0.1618, 'grad_norm': 16.625, 'learning_rate': 1.105566218809981e-05, 'epoch': 0.45} + 45%|████▍ | 700/1563 [07:43<07:07, 2.02it/s] 45%|████▍ | 701/1563 [07:44<08:45, 1.64it/s] 45%|████▍ | 702/1563 [07:44<08:32, 1.68it/s] 45%|████▍ | 703/1563 [07:45<07:54, 1.81it/s] 45%|████▌ | 704/1563 [07:45<08:53, 1.61it/s] 45%|████▌ | 705/1563 [07:46<08:14, 1.73it/s] 45%|████▌ | 706/1563 [07:46<07:43, 1.85it/s] 45%|████▌ | 707/1563 [07:47<08:44, 1.63it/s] 45%|████▌ | 708/1563 [07:48<08:04, 1.76it/s] 45%|████▌ | 709/1563 [07:48<07:37, 1.87it/s] 45%|████▌ | 710/1563 [07:49<08:14, 1.72it/s] 45%|████▌ | 711/1563 [07:50<08:49, 1.61it/s] 46%|████▌ | 712/1563 [07:50<08:43, 1.63it/s] 46%|████▌ | 713/1563 [07:51<09:29, 1.49it/s] 46%|████▌ | 714/1563 [07:51<08:29, 1.67it/s] 46%|████▌ | 715/1563 [07:52<08:45, 1.61it/s] 46%|████▌ | 716/1563 [07:52<08:01, 1.76it/s] 46%|████▌ | 717/1563 [07:53<09:02, 1.56it/s] 46%|████▌ | 718/1563 [07:54<08:42, 1.62it/s] 46%|████▌ | 719/1563 [07:54<08:00, 1.76it/s] 46%|████▌ | 720/1563 [07:55<07:51, 1.79it/s] 46%|████▌ | 721/1563 [07:56<08:49, 1.59it/s] 46%|████▌ | 722/1563 [07:56<09:32, 1.47it/s] 46%|████▋ | 723/1563 [07:57<10:14, 1.37it/s] 46%|████▋ | 724/1563 [07:58<09:01, 1.55it/s] 46%|████▋ | 725/1563 [07:59<09:51, 1.42it/s] 46%|████▋ | 726/1563 [07:59<09:37, 1.45it/s] 47%|████▋ | 727/1563 [08:00<09:37, 1.45it/s] 47%|████▋ | 728/1563 [08:00<08:38, 1.61it/s] 47%|████▋ | 729/1563 [08:01<08:54, 1.56it/s] 47%|████▋ | 730/1563 [08:02<09:17, 1.49it/s] 47%|████▋ | 731/1563 [08:02<08:39, 1.60it/s] 47%|████▋ | 732/1563 [08:03<09:29, 1.46it/s] 47%|████▋ | 733/1563 [08:04<08:56, 1.55it/s] 47%|████▋ | 734/1563 [08:04<08:14, 1.68it/s] 47%|████▋ | 735/1563 [08:05<09:16, 1.49it/s] 47%|████▋ | 736/1563 [08:06<08:31, 1.62it/s] 47%|████▋ | 737/1563 [08:06<09:10, 1.50it/s] 47%|████▋ | 738/1563 [08:07<08:32, 1.61it/s] 47%|████▋ | 739/1563 [08:07<08:03, 1.70it/s] 47%|████▋ | 740/1563 [08:08<08:58, 1.53it/s] 47%|████▋ | 741/1563 [08:09<09:29, 1.44it/s] 47%|████▋ | 742/1563 [08:10<09:52, 1.38it/s] 48%|████▊ | 743/1563 [08:11<10:29, 1.30it/s] 48%|████▊ | 744/1563 [08:11<09:26, 1.45it/s] 48%|████▊ | 745/1563 [08:12<10:10, 1.34it/s] 48%|████▊ | 746/1563 [08:12<09:01, 1.51it/s] 48%|████▊ | 747/1563 [08:13<09:05, 1.50it/s] 48%|████▊ | 748/1563 [08:14<09:55, 1.37it/s] 48%|████▊ | 749/1563 [08:14<08:51, 1.53it/s] 48%|████▊ | 750/1563 [08:15<08:16, 1.64it/s] {'loss': 0.1435, 'grad_norm': 16.0, 'learning_rate': 1.0415866922584774e-05, 'epoch': 0.48} + 48%|████▊ | 750/1563 [08:15<08:16, 1.64it/s] 48%|████▊ | 751/1563 [08:16<08:51, 1.53it/s] 48%|████▊ | 752/1563 [08:16<07:55, 1.70it/s] 48%|████▊ | 753/1563 [08:17<08:10, 1.65it/s] 48%|████▊ | 754/1563 [08:18<09:03, 1.49it/s] 48%|████▊ | 755/1563 [08:18<09:04, 1.48it/s] 48%|████▊ | 756/1563 [08:19<09:52, 1.36it/s] 48%|████▊ | 757/1563 [08:20<08:52, 1.51it/s] 48%|████▊ | 758/1563 [08:20<09:10, 1.46it/s] 49%|████▊ | 759/1563 [08:21<09:37, 1.39it/s] 49%|████▊ | 760/1563 [08:22<08:37, 1.55it/s] 49%|████▊ | 761/1563 [08:23<09:24, 1.42it/s] 49%|████▉ | 762/1563 [08:23<08:16, 1.61it/s] 49%|████▉ | 763/1563 [08:23<07:41, 1.73it/s] 49%|████▉ | 764/1563 [08:24<08:29, 1.57it/s] 49%|████▉ | 765/1563 [08:25<09:21, 1.42it/s] 49%|████▉ | 766/1563 [08:26<09:58, 1.33it/s] 49%|████▉ | 767/1563 [08:27<10:19, 1.28it/s] 49%|████▉ | 768/1563 [08:27<09:46, 1.36it/s] 49%|████▉ | 769/1563 [08:28<08:46, 1.51it/s] 49%|████▉ | 770/1563 [08:28<07:57, 1.66it/s] 49%|████▉ | 771/1563 [08:29<08:52, 1.49it/s] 49%|████▉ | 772/1563 [08:30<09:05, 1.45it/s] 49%|████▉ | 773/1563 [08:31<09:27, 1.39it/s] 50%|████▉ | 774/1563 [08:31<09:18, 1.41it/s] 50%|████▉ | 775/1563 [08:32<08:19, 1.58it/s] 50%|████▉ | 776/1563 [08:33<08:47, 1.49it/s] 50%|████▉ | 777/1563 [08:33<08:36, 1.52it/s] 50%|████▉ | 778/1563 [08:34<08:55, 1.47it/s] 50%|████▉ | 779/1563 [08:35<09:34, 1.36it/s] 50%|████▉ | 780/1563 [08:35<08:46, 1.49it/s] 50%|████▉ | 781/1563 [08:36<08:51, 1.47it/s] 50%|█████ | 782/1563 [08:37<07:59, 1.63it/s] 50%|█████ | 783/1563 [08:37<07:39, 1.70it/s] 50%|█████ | 784/1563 [08:38<07:41, 1.69it/s] 50%|█████ | 785/1563 [08:38<08:01, 1.62it/s] 50%|█████ | 786/1563 [08:39<07:24, 1.75it/s] 50%|█████ | 787/1563 [08:40<08:15, 1.57it/s] 50%|█████ | 788/1563 [08:40<07:32, 1.71it/s] 50%|█████ | 789/1563 [08:41<07:52, 1.64it/s] 51%|█████ | 790/1563 [08:41<07:15, 1.77it/s] 51%|█████ | 791/1563 [08:42<07:17, 1.77it/s] 51%|█████ | 792/1563 [08:42<07:30, 1.71it/s] 51%|█████ | 793/1563 [08:43<06:54, 1.86it/s] 51%|█████ | 794/1563 [08:44<07:46, 1.65it/s] 51%|█████ | 795/1563 [08:44<07:07, 1.80it/s] 51%|█████ | 796/1563 [08:45<08:13, 1.55it/s] 51%|█████ | 797/1563 [08:46<09:02, 1.41it/s] 51%|█████ | 798/1563 [08:46<08:12, 1.55it/s] 51%|█████ | 799/1563 [08:47<09:00, 1.41it/s] 51%|█████ | 800/1563 [08:48<09:27, 1.34it/s] {'loss': 0.1442, 'grad_norm': 1.359375, 'learning_rate': 9.776071657069739e-06, 'epoch': 0.51} + 51%|█████ | 800/1563 [08:48<09:27, 1.34it/s] 51%|█████ | 801/1563 [08:49<09:30, 1.34it/s] 51%|█████▏ | 802/1563 [08:49<09:12, 1.38it/s] 51%|█████▏ | 803/1563 [08:50<08:01, 1.58it/s] 51%|█████▏ | 804/1563 [08:51<08:58, 1.41it/s] 52%|█████▏ | 805/1563 [08:51<09:24, 1.34it/s] 52%|█████▏ | 806/1563 [08:52<09:22, 1.35it/s] 52%|█████▏ | 807/1563 [08:53<08:33, 1.47it/s] 52%|█████▏ | 808/1563 [08:53<07:47, 1.61it/s] 52%|█████▏ | 809/1563 [08:54<08:37, 1.46it/s] 52%|█████▏ | 810/1563 [08:55<09:07, 1.38it/s] 52%|█████▏ | 811/1563 [08:55<07:59, 1.57it/s] 52%|█████▏ | 812/1563 [08:56<08:13, 1.52it/s] 52%|█████▏ | 813/1563 [08:57<08:36, 1.45it/s] 52%|█████▏ | 814/1563 [08:57<08:31, 1.46it/s] 52%|█████▏ | 815/1563 [08:58<09:05, 1.37it/s] 52%|█████▏ | 816/1563 [08:59<09:21, 1.33it/s] 52%|█████▏ | 817/1563 [09:00<09:04, 1.37it/s] 52%|█████▏ | 818/1563 [09:00<08:58, 1.38it/s] 52%|█████▏ | 819/1563 [09:01<07:51, 1.58it/s] 52%|█████▏ | 820/1563 [09:01<07:42, 1.61it/s] 53%|█████▎ | 821/1563 [09:02<07:15, 1.70it/s] 53%|█████▎ | 822/1563 [09:02<06:44, 1.83it/s] 53%|█████▎ | 823/1563 [09:03<07:52, 1.57it/s] 53%|█████▎ | 824/1563 [09:04<08:42, 1.41it/s] 53%|█████▎ | 825/1563 [09:05<08:46, 1.40it/s] 53%|█████▎ | 826/1563 [09:05<08:17, 1.48it/s] 53%|█████▎ | 827/1563 [09:06<07:30, 1.64it/s] 53%|█████▎ | 828/1563 [09:07<08:13, 1.49it/s] 53%|█████▎ | 829/1563 [09:08<08:40, 1.41it/s] 53%|█████▎ | 830/1563 [09:08<08:19, 1.47it/s] 53%|█████▎ | 831/1563 [09:09<08:19, 1.46it/s] 53%|█████▎ | 832/1563 [09:09<07:37, 1.60it/s] 53%|█████▎ | 833/1563 [09:10<07:50, 1.55it/s] 53%|█████▎ | 834/1563 [09:10<07:03, 1.72it/s] 53%|█████▎ | 835/1563 [09:11<07:14, 1.68it/s] 53%|█████▎ | 836/1563 [09:12<08:08, 1.49it/s] 54%|█████▎ | 837/1563 [09:13<08:08, 1.49it/s] 54%|█████▎ | 838/1563 [09:13<08:21, 1.44it/s] 54%|█████▎ | 839/1563 [09:14<08:51, 1.36it/s] 54%|████���▎ | 840/1563 [09:15<08:45, 1.37it/s] 54%|█████▍ | 841/1563 [09:16<09:04, 1.33it/s] 54%|█████▍ | 842/1563 [09:17<09:27, 1.27it/s] 54%|█████▍ | 843/1563 [09:17<09:41, 1.24it/s] 54%|█████▍ | 844/1563 [09:18<09:18, 1.29it/s] 54%|█████▍ | 845/1563 [09:19<08:53, 1.35it/s] 54%|█████▍ | 846/1563 [09:19<08:22, 1.43it/s] 54%|█████▍ | 847/1563 [09:20<07:29, 1.59it/s] 54%|█████▍ | 848/1563 [09:21<08:12, 1.45it/s] 54%|█████▍ | 849/1563 [09:22<08:42, 1.37it/s] 54%|█████▍ | 850/1563 [09:22<07:55, 1.50it/s] {'loss': 0.1416, 'grad_norm': 1.0859375, 'learning_rate': 9.136276391554704e-06, 'epoch': 0.54} + 54%|█████▍ | 850/1563 [09:22<07:55, 1.50it/s] 54%|█████▍ | 851/1563 [09:23<08:05, 1.47it/s] 55%|█████▍ | 852/1563 [09:23<07:20, 1.62it/s] 55%|█████▍ | 853/1563 [09:24<07:00, 1.69it/s] 55%|█████▍ | 854/1563 [09:24<06:31, 1.81it/s] 55%|█████▍ | 855/1563 [09:25<07:20, 1.61it/s] 55%|█████▍ | 856/1563 [09:26<07:03, 1.67it/s] 55%|█████▍ | 857/1563 [09:26<06:33, 1.79it/s] 55%|█████▍ | 858/1563 [09:27<07:33, 1.55it/s] 55%|█████▍ | 859/1563 [09:28<07:41, 1.53it/s] 55%|█████▌ | 860/1563 [09:28<07:04, 1.66it/s] 55%|█████▌ | 861/1563 [09:29<07:14, 1.62it/s] 55%|█████▌ | 862/1563 [09:29<06:28, 1.81it/s] 55%|█████▌ | 863/1563 [09:30<06:04, 1.92it/s] 55%|█████▌ | 864/1563 [09:30<06:33, 1.78it/s] 55%|█████▌ | 865/1563 [09:31<07:24, 1.57it/s] 55%|█████▌ | 866/1563 [09:31<06:43, 1.73it/s] 55%|█████▌ | 867/1563 [09:32<07:26, 1.56it/s] 56%|█████▌ | 868/1563 [09:33<07:24, 1.56it/s] 56%|█████▌ | 869/1563 [09:33<06:48, 1.70it/s] 56%|█████▌ | 870/1563 [09:34<06:22, 1.81it/s] 56%|█████▌ | 871/1563 [09:34<06:05, 1.89it/s] 56%|█████▌ | 872/1563 [09:35<07:09, 1.61it/s] 56%|█████▌ | 873/1563 [09:36<06:40, 1.72it/s] 56%|█████▌ | 874/1563 [09:36<06:54, 1.66it/s] 56%|█████▌ | 875/1563 [09:37<07:20, 1.56it/s] 56%|█████▌ | 876/1563 [09:38<07:35, 1.51it/s] 56%|█████▌ | 877/1563 [09:38<07:23, 1.55it/s] 56%|█████▌ | 878/1563 [09:39<06:46, 1.68it/s] 56%|█████▌ | 879/1563 [09:39<07:08, 1.60it/s] 56%|█████▋ | 880/1563 [09:40<07:45, 1.47it/s] 56%|█████▋ | 881/1563 [09:41<06:54, 1.65it/s] 56%|█████▋ | 882/1563 [09:42<07:47, 1.46it/s] 56%|█████▋ | 883/1563 [09:42<07:02, 1.61it/s] 57%|█████▋ | 884/1563 [09:43<07:49, 1.45it/s] 57%|█████▋ | 885/1563 [09:44<07:44, 1.46it/s] 57%|█████▋ | 886/1563 [09:44<07:30, 1.50it/s] 57%|█████▋ | 887/1563 [09:45<06:42, 1.68it/s] 57%|█████▋ | 888/1563 [09:45<07:03, 1.59it/s] 57%|█████▋ | 889/1563 [09:46<07:25, 1.51it/s] 57%|█████▋ | 890/1563 [09:47<07:57, 1.41it/s] 57%|█████▋ | 891/1563 [09:47<07:22, 1.52it/s] 57%|█████▋ | 892/1563 [09:48<06:45, 1.66it/s] 57%|█████▋ | 893/1563 [09:48<06:24, 1.74it/s] 57%|█████▋ | 894/1563 [09:49<06:08, 1.82it/s] 57%|█████▋ | 895/1563 [09:50<07:10, 1.55it/s] 57%|█████▋ | 896/1563 [09:50<06:50, 1.63it/s] 57%|█████▋ | 897/1563 [09:51<06:38, 1.67it/s] 57%|█████▋ | 898/1563 [09:51<06:20, 1.75it/s] 58%|█████▊ | 899/1563 [09:52<06:25, 1.72it/s] 58%|█████▊ | 900/1563 [09:52<06:03, 1.83it/s] {'loss': 0.1478, 'grad_norm': 5.09375, 'learning_rate': 8.496481126039668e-06, 'epoch': 0.58} + 58%|█████▊ | 900/1563 [09:53<06:03, 1.83it/s] 58%|█████▊ | 901/1563 [09:53<05:52, 1.88it/s] 58%|█████▊ | 902/1563 [09:54<06:17, 1.75it/s] 58%|█████▊ | 903/1563 [09:54<06:01, 1.83it/s] 58%|█████▊ | 904/1563 [09:55<07:03, 1.56it/s] 58%|█████▊ | 905/1563 [09:55<06:29, 1.69it/s] 58%|█████▊ | 906/1563 [09:56<06:09, 1.78it/s] 58%|█████▊ | 907/1563 [09:57<06:46, 1.61it/s] 58%|█████▊ | 908/1563 [09:57<06:19, 1.73it/s] 58%|█████▊ | 909/1563 [09:58<06:40, 1.63it/s] 58%|█████▊ | 910/1563 [09:58<06:14, 1.75it/s] 58%|█████▊ | 911/1563 [09:59<06:05, 1.78it/s] 58%|█████▊ | 912/1563 [10:00<06:28, 1.68it/s] 58%|█████▊ | 913/1563 [10:00<06:12, 1.74it/s] 58%|█████▊ | 914/1563 [10:01<06:31, 1.66it/s] 59%|█████▊ | 915/1563 [10:01<06:19, 1.71it/s] 59%|█████▊ | 916/1563 [10:02<06:11, 1.74it/s] 59%|█████▊ | 917/1563 [10:02<05:45, 1.87it/s] 59%|█████▊ | 918/1563 [10:03<05:36, 1.92it/s] 59%|█████▉ | 919/1563 [10:04<06:17, 1.70it/s] 59%|█████▉ | 920/1563 [10:04<06:56, 1.55it/s] 59%|█████▉ | 921/1563 [10:05<07:02, 1.52it/s] 59%|█████▉ | 922/1563 [10:06<07:04, 1.51it/s] 59%|█████▉ | 923/1563 [10:06<07:23, 1.44it/s] 59%|█████▉ | 924/1563 [10:07<07:12, 1.48it/s] 59%|█████▉ | 925/1563 [10:08<06:29, 1.64it/s] 59%|█████▉ | 926/1563 [10:08<07:05, 1.50it/s] 59%|█████▉ | 927/1563 [10:09<07:20, 1.44it/s] 59%|█████▉ | 928/1563 [10:10<06:42, 1.58it/s] 59%|█████▉ | 929/1563 [10:10<06:10, 1.71it/s] 60%|█████▉ | 930/1563 [10:10<05:44, 1.84it/s] 60%|█████▉ | 931/1563 [10:11<05:30, 1.91it/s] 60%|█████▉ | 932/1563 [10:11<05:13, 2.01it/s] 60%|█████▉ | 933/1563 [10:12<05:47, 1.81it/s] 60%|█████▉ | 934/1563 [10:13<06:41, 1.56it/s] 60%|█████▉ | 935/1563 [10:13<06:17, 1.66it/s] 60%|█████▉ | 936/1563 [10:14<07:04, 1.48it/s] 60%|█████▉ | 937/1563 [10:15<06:19, 1.65it/s] 60%|██████ | 938/1563 [10:16<06:57, 1.50it/s] 60%|██████ | 939/1563 [10:16<07:07, 1.46it/s] 60%|██████ | 940/1563 [10:17<06:38, 1.56it/s] 60%|██████ | 941/1563 [10:17<05:58, 1.73it/s] 60%|██████ | 942/1563 [10:18<06:34, 1.57it/s] 60%|██████ | 943/1563 [10:19<06:54, 1.50it/s] 60%|██████ | 944/1563 [10:19<06:49, 1.51it/s] 60%|██████ | 945/1563 [10:20<06:18, 1.63it/s] 61%|██████ | 946/1563 [10:20<05:44, 1.79it/s] 61%|██████ | 947/1563 [10:21<06:04, 1.69it/s] 61%|██████ | 948/1563 [10:21<05:31, 1.86it/s] 61%|██████ | 949/1563 [10:22<05:19, 1.92it/s] 61%|██████ | 950/1563 [10:22<05:15, 1.94it/s] {'loss': 0.1462, 'grad_norm': 27.75, 'learning_rate': 7.856685860524633e-06, 'epoch': 0.61} + 61%|██████ | 950/1563 [10:22<05:15, 1.94it/s] 61%|██████ | 951/1563 [10:23<06:16, 1.63it/s] 61%|██████ | 952/1563 [10:24<05:56, 1.71it/s] 61%|██████ | 953/1563 [10:24<05:36, 1.81it/s] 61%|██████ | 954/1563 [10:25<05:29, 1.85it/s] 61%|██████ | 955/1563 [10:26<06:08, 1.65it/s] 61%|██████ | 956/1563 [10:26<06:20, 1.59it/s] 61%|██████ | 957/1563 [10:27<06:33, 1.54it/s] 61%|██████▏ | 958/1563 [10:27<06:14, 1.62it/s] 61%|██████▏ | 959/1563 [10:28<06:08, 1.64it/s] 61%|██████▏ | 960/1563 [10:29<06:36, 1.52it/s] 61%|██████▏ | 961/1563 [10:29<06:41, 1.50it/s] 62%|██████▏ | 962/1563 [10:30<06:36, 1.52it/s] 62%|██████▏ | 963/1563 [10:31<06:15, 1.60it/s] 62%|██████▏ | 964/1563 [10:31<06:36, 1.51it/s] 62%|██████▏ | 965/1563 [10:32<06:22, 1.56it/s] 62%|██████▏ | 966/1563 [10:33<06:15, 1.59it/s] 62%|██████▏ | 967/1563 [10:33<06:32, 1.52it/s] 62%|██████▏ | 968/1563 [10:34<05:59, 1.65it/s] 62%|██████▏ | 969/1563 [10:35<06:41, 1.48it/s] 62%|██████▏ | 970/1563 [10:35<06:04, 1.63it/s] 62%|██████▏ | 971/1563 [10:36<06:34, 1.50it/s] 62%|██████▏ | 972/1563 [10:36<06:05, 1.62it/s] 62%|██████▏ | 973/1563 [10:37<06:50, 1.44it/s] 62%|██████▏ | 974/1563 [10:38<06:47, 1.45it/s] 62%|██████▏ | 975/1563 [10:39<07:15, 1.35it/s] 62%|██████▏ | 976/1563 [10:40<07:29, 1.30it/s] 63%|██████▎ | 977/1563 [10:40<07:16, 1.34it/s] 63%|██████▎ | 978/1563 [10:41<07:03, 1.38it/s] 63%|██████▎ | 979/1563 [10:42<07:03, 1.38it/s] 63%|██████▎ | 980/1563 [10:43<07:27, 1.30it/s] 63%|██████▎ | 981/1563 [10:43<07:39, 1.27it/s] 63%|██████▎ | 982/1563 [10:44<07:41, 1.26it/s] 63%|██████▎ | 983/1563 [10:45<07:51, 1.23it/s] 63%|██████▎ | 984/1563 [10:46<07:00, 1.38it/s] 63%|██████▎ | 985/1563 [10:46<06:14, 1.54it/s] 63%|██████▎ | 986/1563 [10:47<06:52, 1.40it/s] 63%|██████▎ | 987/1563 [10:48<06:53, 1.39it/s] 63%|██████▎ | 988/1563 [10:48<06:53, 1.39it/s] 63%|██████▎ | 989/1563 [10:49<06:43, 1.42it/s] 63%|██████▎ | 990/1563 [10:50<06:22, 1.50it/s] 63%|██████▎ | 991/1563 [10:50<06:46, 1.41it/s] 63%|██████▎ | 992/1563 [10:51<06:14, 1.53it/s] 64%|██████▎ | 993/1563 [10:52<06:04, 1.56it/s] 64%|██████▎ | 994/1563 [10:52<06:06, 1.55it/s] 64%|██████▎ | 995/1563 [10:53<05:51, 1.62it/s] 64%|██████▎ | 996/1563 [10:53<05:33, 1.70it/s] 64%|██████▍ | 997/1563 [10:54<05:15, 1.80it/s] 64%|██████▍ | 998/1563 [10:55<06:05, 1.54it/s] 64%|██████▍ | 999/1563 [10:55<05:53, 1.60it/s] 64%|██████▍ | 1000/1563 [10:56<06:25, 1.46it/s] {'loss': 0.1499, 'grad_norm': 0.921875, 'learning_rate': 7.216890595009598e-06, 'epoch': 0.64} + 64%|██████▍ | 1000/1563 [10:56<06:25, 1.46it/s] 64%|██████▍ | 1001/1563 [10:57<06:18, 1.48it/s] 64%|██████▍ | 1002/1563 [10:57<06:20, 1.47it/s] 64%|██████▍ | 1003/1563 [10:58<06:31, 1.43it/s] 64%|██████▍ | 1004/1563 [10:59<06:33, 1.42it/s] 64%|██████▍ | 1005/1563 [10:59<06:01, 1.54it/s] 64%|██████▍ | 1006/1563 [11:00<06:42, 1.39it/s] 64%|██████▍ | 1007/1563 [11:01<07:06, 1.31it/s] 64%|██████▍ | 1008/1563 [11:02<07:16, 1.27it/s] 65%|██████▍ | 1009/1563 [11:03<06:51, 1.34it/s] 65%|██████▍ | 1010/1563 [11:04<07:09, 1.29it/s] 65%|██████▍ | 1011/1563 [11:04<07:22, 1.25it/s] 65%|██████▍ | 1012/1563 [11:05<06:26, 1.43it/s] 65%|██████▍ | 1013/1563 [11:06<06:29, 1.41it/s] 65%|██████▍ | 1014/1563 [11:06<05:42, 1.60it/s] 65%|██████▍ | 1015/1563 [11:06<05:21, 1.70it/s] 65%|██████▌ | 1016/1563 [11:07<05:07, 1.78it/s] 65%|██████▌ | 1017/1563 [11:07<04:50, 1.88it/s] 65%|██████▌ | 1018/1563 [11:08<05:20, 1.70it/s] 65%|██████▌ | 1019/1563 [11:09<05:31, 1.64it/s] 65%|██████▌ | 1020/1563 [11:09<05:06, 1.77it/s] 65%|██████▌ | 1021/1563 [11:10<05:30, 1.64it/s] 65%|██████▌ | 1022/1563 [11:11<05:14, 1.72it/s] 65%|██████▌ | 1023/1563 [11:11<04:50, 1.86it/s] 66%|██████▌ | 1024/1563 [11:11<04:45, 1.89it/s] 66%|██████▌ | 1025/1563 [11:12<05:09, 1.74it/s] 66%|██████▌ | 1026/1563 [11:13<04:52, 1.83it/s] 66%|██████▌ | 1027/1563 [11:13<05:37, 1.59it/s] 66%|██████▌ | 1028/1563 [11:14<06:13, 1.43it/s] 66%|██████▌ | 1029/1563 [11:15<06:31, 1.36it/s] 66%|██████▌ | 1030/1563 [11:16<06:24, 1.38it/s] 66%|██████▌ | 1031/1563 [11:17<06:19, 1.40it/s] 66%|██████▌ | 1032/1563 [11:17<05:35, 1.58it/s] 66%|██████▌ | 1033/1563 [11:18<05:39, 1.56it/s] 66%|██████▌ | 1034/1563 [11:18<06:15, 1.41it/s] 66%|██████▌ | 1035/1563 [11:19<06:14, 1.41it/s] 66%|██████▋ | 1036/1563 [11:20<05:33, 1.58it/s] 66%|██████▋ | 1037/1563 [11:20<05:44, 1.53it/s] 66%|██████▋ | 1038/1563 [11:21<06:09, 1.42it/s] 66%|██████▋ | 1039/1563 [11:22<05:44, 1.52it/s] 67%|██████▋ | 1040/1563 [11:23<06:15, 1.39it/s] 67%|██████▋ | 1041/1563 [11:23<05:40, 1.53it/s] 67%|██████▋ | 1042/1563 [11:24<06:04, 1.43it/s] 67%|██████▋ | 1043/1563 [11:24<05:33, 1.56it/s] 67%|██████▋ | 1044/1563 [11:25<06:07, 1.41it/s] 67%|██████▋ | 1045/1563 [11:26<06:12, 1.39it/s] 67%|██████▋ | 1046/1563 [11:26<05:27, 1.58it/s] 67%|██████▋ | 1047/1563 [11:27<05:57, 1.44it/s] 67%|██████▋ | 1048/1563 [11:28<05:17, 1.62it/s] 67%|██████▋ | 1049/1563 [11:28<04:51, 1.77it/s] 67%|██████▋ | 1050/1563 [11:29<04:28, 1.91it/s] {'loss': 0.1438, 'grad_norm': 13.25, 'learning_rate': 6.577095329494563e-06, 'epoch': 0.67} + 67%|██████▋ | 1050/1563 [11:29<04:28, 1.91it/s] 67%|██████▋ | 1051/1563 [11:29<04:18, 1.98it/s] 67%|██████▋ | 1052/1563 [11:30<04:14, 2.01it/s] 67%|██████▋ | 1053/1563 [11:30<05:19, 1.60it/s] 67%|██████▋ | 1054/1563 [11:31<05:51, 1.45it/s] 67%|██████▋ | 1055/1563 [11:32<06:14, 1.36it/s] 68%|██████▊ | 1056/1563 [11:33<06:18, 1.34it/s] 68%|██████▊ | 1057/1563 [11:34<06:13, 1.35it/s] 68%|██████▊ | 1058/1563 [11:34<06:28, 1.30it/s] 68%|██████▊ | 1059/1563 [11:35<06:41, 1.26it/s] 68%|██████▊ | 1060/1563 [11:36<06:28, 1.29it/s] 68%|██████▊ | 1061/1563 [11:37<06:31, 1.28it/s] 68%|██████▊ | 1062/1563 [11:38<06:34, 1.27it/s] 68%|██████▊ | 1063/1563 [11:38<06:37, 1.26it/s] 68%|██████▊ | 1064/1563 [11:39<05:58, 1.39it/s] 68%|██████▊ | 1065/1563 [11:40<06:18, 1.31it/s] 68%|██████▊ | 1066/1563 [11:40<05:30, 1.50it/s] 68%|██████▊ | 1067/1563 [11:41<05:45, 1.44it/s] 68%|██████▊ | 1068/1563 [11:42<05:45, 1.43it/s] 68%|██████▊ | 1069/1563 [11:43<06:04, 1.36it/s] 68%|██████▊ | 1070/1563 [11:43<06:20, 1.29it/s] 69%|██████▊ | 1071/1563 [11:44<05:32, 1.48it/s] 69%|██████▊ | 1072/1563 [11:45<05:59, 1.36it/s] 69%|██████▊ | 1073/1563 [11:46<06:18, 1.30it/s] 69%|██████▊ | 1074/1563 [11:46<05:28, 1.49it/s] 69%|██████▉ | 1075/1563 [11:47<05:52, 1.38it/s] 69%|██████▉ | 1076/1563 [11:48<06:06, 1.33it/s] 69%|██████▉ | 1077/1563 [11:48<05:45, 1.41it/s] 69%|██████▉ | 1078/1563 [11:49<05:37, 1.44it/s] 69%|██████▉ | 1079/1563 [11:50<05:55, 1.36it/s] 69%|██████▉ | 1080/1563 [11:50<05:31, 1.46it/s] 69%|██████▉ | 1081/1563 [11:51<05:58, 1.34it/s] 69%|██████▉ | 1082/1563 [11:52<06:16, 1.28it/s] 69%|██████▉ | 1083/1563 [11:53<05:57, 1.34it/s] 69%|██████▉ | 1084/1563 [11:54<06:07, 1.30it/s] 69%|██████▉ | 1085/1563 [11:54<05:44, 1.39it/s] 69%|██████▉ | 1086/1563 [11:55<05:18, 1.50it/s] 70%|██████▉ | 1087/1563 [11:55<05:18, 1.50it/s] 70%|██████▉ | 1088/1563 [11:56<04:56, 1.60it/s] 70%|██████▉ | 1089/1563 [11:56<04:36, 1.72it/s] 70%|██████▉ | 1090/1563 [11:57<05:12, 1.51it/s] 70%|██████▉ | 1091/1563 [11:58<05:40, 1.39it/s] 70%|██████▉ | 1092/1563 [11:59<05:11, 1.51it/s] 70%|██████▉ | 1093/1563 [11:59<05:08, 1.52it/s] 70%|██████▉ | 1094/1563 [12:00<05:35, 1.40it/s] 70%|███████ | 1095/1563 [12:01<04:58, 1.57it/s] 70%|███████ | 1096/1563 [12:01<04:30, 1.73it/s] 70%|███████ | 1097/1563 [12:02<04:39, 1.67it/s] 70%|███████ | 1098/1563 [12:03<05:12, 1.49it/s] 70%|███████ | 1099/1563 [12:03<05:29, 1.41it/s] 70%|███████ | 1100/1563 [12:04<05:33, 1.39it/s] {'loss': 0.1419, 'grad_norm': 9.1875, 'learning_rate': 5.937300063979527e-06, 'epoch': 0.7} + 70%|███████ | 1100/1563 [12:04<05:33, 1.39it/s] 70%|███████ | 1101/1563 [12:05<05:09, 1.49it/s] 71%|███████ | 1102/1563 [12:06<05:32, 1.39it/s] 71%|███████ | 1103/1563 [12:06<05:05, 1.51it/s] 71%|███████ | 1104/1563 [12:07<04:39, 1.64it/s] 71%|███████ | 1105/1563 [12:07<04:25, 1.72it/s] 71%|███████ | 1106/1563 [12:08<05:03, 1.51it/s] 71%|███████ | 1107/1563 [12:09<05:04, 1.50it/s] 71%|███████ | 1108/1563 [12:09<05:17, 1.43it/s] 71%|███████ | 1109/1563 [12:10<05:40, 1.33it/s] 71%|███████ | 1110/1563 [12:11<04:53, 1.54it/s] 71%|███████ | 1111/1563 [12:11<05:16, 1.43it/s] 71%|███████ | 1112/1563 [12:12<04:45, 1.58it/s] 71%|███████ | 1113/1563 [12:13<05:12, 1.44it/s] 71%|███████▏ | 1114/1563 [12:14<05:22, 1.39it/s] 71%|███████▏ | 1115/1563 [12:14<05:36, 1.33it/s] 71%|███████▏ | 1116/1563 [12:15<05:01, 1.48it/s] 71%|███████▏ | 1117/1563 [12:16<05:16, 1.41it/s] 72%|███████▏ | 1118/1563 [12:16<04:58, 1.49it/s] 72%|███████▏ | 1119/1563 [12:17<05:19, 1.39it/s] 72%|███████▏ | 1120/1563 [12:18<05:27, 1.35it/s] 72%|███████▏ | 1121/1563 [12:19<05:20, 1.38it/s] 72%|███████▏ | 1122/1563 [12:19<04:50, 1.52it/s] 72%|███████▏ | 1123/1563 [12:20<05:14, 1.40it/s] 72%|███████▏ | 1124/1563 [12:21<05:25, 1.35it/s] 72%|███████▏ | 1125/1563 [12:21<05:27, 1.34it/s] 72%|███████▏ | 1126/1563 [12:22<04:53, 1.49it/s] 72%|███████▏ | 1127/1563 [12:23<04:40, 1.56it/s] 72%|███████▏ | 1128/1563 [12:23<04:53, 1.48it/s] 72%|███████▏ | 1129/1563 [12:24<04:26, 1.63it/s] 72%|███████▏ | 1130/1563 [12:25<04:54, 1.47it/s] 72%|███████▏ | 1131/1563 [12:25<05:15, 1.37it/s] 72%|███████▏ | 1132/1563 [12:26<04:35, 1.57it/s] 72%|███████▏ | 1133/1563 [12:27<04:38, 1.55it/s] 73%|███████▎ | 1134/1563 [12:27<04:20, 1.65it/s] 73%|███████▎ | 1135/1563 [12:28<04:41, 1.52it/s] 73%|███████▎ | 1136/1563 [12:29<04:55, 1.45it/s] 73%|███████▎ | 1137/1563 [12:29<05:16, 1.35it/s] 73%|███████▎ | 1138/1563 [12:30<04:39, 1.52it/s] 73%|███████▎ | 1139/1563 [12:31<04:37, 1.53it/s] 73%|███████▎ | 1140/1563 [12:31<04:26, 1.59it/s] 73%|███████▎ | 1141/1563 [12:32<04:41, 1.50it/s] 73%|███████▎ | 1142/1563 [12:33<05:01, 1.40it/s] 73%|███████▎ | 1143/1563 [12:33<05:10, 1.35it/s] 73%|███████▎ | 1144/1563 [12:34<05:23, 1.30it/s] 73%|███████▎ | 1145/1563 [12:35<05:30, 1.26it/s] 73%|███████▎ | 1146/1563 [12:36<05:33, 1.25it/s] 73%|███████▎ | 1147/1563 [12:37<05:38, 1.23it/s] 73%|███████▎ | 1148/1563 [12:38<05:31, 1.25it/s] 74%|███████▎ | 1149/1563 [12:38<04:52, 1.41it/s] 74%|███████▎ | 1150/1563 [12:39<05:05, 1.35it/s] {'loss': 0.1443, 'grad_norm': 3.234375, 'learning_rate': 5.297504798464492e-06, 'epoch': 0.74} + 74%|███████▎ | 1150/1563 [12:39<05:05, 1.35it/s] 74%|███████▎ | 1151/1563 [12:40<05:08, 1.34it/s] 74%|███████▎ | 1152/1563 [12:41<05:18, 1.29it/s] 74%|███████▍ | 1153/1563 [12:41<05:09, 1.32it/s] 74%|███████▍ | 1154/1563 [12:42<04:29, 1.52it/s] 74%|███████▍ | 1155/1563 [12:42<04:07, 1.65it/s] 74%|███████▍ | 1156/1563 [12:43<04:06, 1.65it/s] 74%|███████▍ | 1157/1563 [12:43<03:56, 1.72it/s] 74%|███████▍ | 1158/1563 [12:44<03:40, 1.84it/s] 74%|███████▍ | 1159/1563 [12:44<03:53, 1.73it/s] 74%|███████▍ | 1160/1563 [12:45<04:07, 1.63it/s] 74%|███████▍ | 1161/1563 [12:46<04:17, 1.56it/s] 74%|███████▍ | 1162/1563 [12:47<04:32, 1.47it/s] 74%|███████▍ | 1163/1563 [12:47<04:15, 1.56it/s] 74%|███████▍ | 1164/1563 [12:48<04:09, 1.60it/s] 75%|███████▍ | 1165/1563 [12:48<03:56, 1.68it/s] 75%|███████▍ | 1166/1563 [12:49<04:27, 1.49it/s] 75%|███████▍ | 1167/1563 [12:50<04:48, 1.37it/s] 75%|███████▍ | 1168/1563 [12:50<04:15, 1.54it/s] 75%|███████▍ | 1169/1563 [12:51<04:28, 1.46it/s] 75%|███████▍ | 1170/1563 [12:52<04:24, 1.49it/s] 75%|███████▍ | 1171/1563 [12:53<04:37, 1.41it/s] 75%|███████▍ | 1172/1563 [12:53<04:11, 1.55it/s] 75%|███████▌ | 1173/1563 [12:54<04:31, 1.44it/s] 75%|███████▌ | 1174/1563 [12:55<04:50, 1.34it/s] 75%|███████▌ | 1175/1563 [12:55<04:41, 1.38it/s] 75%|███████▌ | 1176/1563 [12:56<04:54, 1.31it/s] 75%|███████▌ | 1177/1563 [12:57<04:43, 1.36it/s] 75%|███████▌ | 1178/1563 [12:57<04:11, 1.53it/s] 75%|███████▌ | 1179/1563 [12:58<04:27, 1.44it/s] 75%|███████▌ | 1180/1563 [12:59<04:36, 1.38it/s] 76%|███████▌ | 1181/1563 [13:00<04:09, 1.53it/s] 76%|███████▌ | 1182/1563 [13:00<03:47, 1.67it/s] 76%|███████▌ | 1183/1563 [13:01<04:03, 1.56it/s] 76%|███████▌ | 1184/1563 [13:01<03:44, 1.69it/s] 76%|███████▌ | 1185/1563 [13:02<03:31, 1.79it/s] 76%|███████▌ | 1186/1563 [13:02<03:17, 1.91it/s] 76%|███████▌ | 1187/1563 [13:03<03:45, 1.67it/s] 76%|███████▌ | 1188/1563 [13:04<04:13, 1.48it/s] 76%|███████▌ | 1189/1563 [13:05<04:35, 1.36it/s] 76%|███████▌ | 1190/1563 [13:05<04:28, 1.39it/s] 76%|███████▌ | 1191/1563 [13:06<03:59, 1.56it/s] 76%|███████▋ | 1192/1563 [13:06<04:03, 1.53it/s] 76%|███████▋ | 1193/1563 [13:07<04:25, 1.39it/s] 76%|███████▋ | 1194/1563 [13:08<04:10, 1.47it/s] 76%|███████▋ | 1195/1563 [13:09<04:30, 1.36it/s] 77%|███████▋ | 1196/1563 [13:10<04:43, 1.29it/s] 77%|███████▋ | 1197/1563 [13:10<04:10, 1.46it/s] 77%|███████▋ | 1198/1563 [13:11<03:44, 1.63it/s] 77%|███████▋ | 1199/1563 [13:11<03:30, 1.73it/s] 77%|███████▋ | 1200/1563 [13:12<03:26, 1.75it/s] {'loss': 0.1396, 'grad_norm': 1.1015625, 'learning_rate': 4.657709532949457e-06, 'epoch': 0.77} + 77%|███████▋ | 1200/1563 [13:12<03:26, 1.75it/s] 77%|███████▋ | 1201/1563 [13:12<03:54, 1.54it/s] 77%|███████▋ | 1202/1563 [13:13<04:04, 1.48it/s] 77%|███████▋ | 1203/1563 [13:14<03:56, 1.52it/s] 77%|███████▋ | 1204/1563 [13:14<03:30, 1.70it/s] 77%|███████▋ | 1205/1563 [13:15<03:59, 1.50it/s] 77%|███████▋ | 1206/1563 [13:16<03:32, 1.68it/s] 77%|███████▋ | 1207/1563 [13:16<03:41, 1.60it/s] 77%|███████▋ | 1208/1563 [13:17<03:47, 1.56it/s] 77%|███████▋ | 1209/1563 [13:18<03:51, 1.53it/s] 77%|███████▋ | 1210/1563 [13:18<03:36, 1.63it/s] 77%|███████▋ | 1211/1563 [13:19<04:00, 1.46it/s] 78%|███████▊ | 1212/1563 [13:19<03:37, 1.61it/s] 78%|███████▊ | 1213/1563 [13:20<04:02, 1.45it/s] 78%|███████▊ | 1214/1563 [13:21<03:40, 1.58it/s] 78%|███████▊ | 1215/1563 [13:21<03:50, 1.51it/s] 78%|███████▊ | 1216/1563 [13:22<04:08, 1.40it/s] 78%|███████▊ | 1217/1563 [13:23<03:40, 1.57it/s] 78%|███████▊ | 1218/1563 [13:23<03:45, 1.53it/s] 78%|███████▊ | 1219/1563 [13:24<04:03, 1.41it/s] 78%|███████▊ | 1220/1563 [13:25<03:46, 1.51it/s] 78%|███████▊ | 1221/1563 [13:26<03:51, 1.48it/s] 78%|███████▊ | 1222/1563 [13:26<03:31, 1.61it/s] 78%|███████▊ | 1223/1563 [13:27<03:51, 1.47it/s] 78%|███████▊ | 1224/1563 [13:28<04:00, 1.41it/s] 78%|███████▊ | 1225/1563 [13:28<03:48, 1.48it/s] 78%|███████▊ | 1226/1563 [13:29<03:28, 1.62it/s] 79%|███████▊ | 1227/1563 [13:29<03:38, 1.53it/s] 79%|███████▊ | 1228/1563 [13:30<03:14, 1.73it/s] 79%|███████▊ | 1229/1563 [13:31<03:28, 1.60it/s] 79%|███████▊ | 1230/1563 [13:31<03:25, 1.62it/s] 79%|███████▉ | 1231/1563 [13:32<03:48, 1.45it/s] 79%|███████▉ | 1232/1563 [13:33<03:54, 1.41it/s] 79%|███████▉ | 1233/1563 [13:33<03:35, 1.53it/s] 79%|███████▉ | 1234/1563 [13:34<03:17, 1.67it/s] 79%|███████▉ | 1235/1563 [13:34<02:56, 1.86it/s] 79%|███████▉ | 1236/1563 [13:35<03:18, 1.65it/s] 79%|███████▉ | 1237/1563 [13:35<03:01, 1.80it/s] 79%|███████▉ | 1238/1563 [13:36<03:26, 1.57it/s] 79%|███████▉ | 1239/1563 [13:37<03:32, 1.52it/s] 79%|███████▉ | 1240/1563 [13:37<03:14, 1.66it/s] 79%|███████▉ | 1241/1563 [13:38<03:37, 1.48it/s] 79%|███████▉ | 1242/1563 [13:39<03:12, 1.67it/s] 80%|███████▉ | 1243/1563 [13:39<03:19, 1.61it/s] 80%|███████▉ | 1244/1563 [13:40<03:25, 1.55it/s] 80%|███████▉ | 1245/1563 [13:41<03:17, 1.61it/s] 80%|███████▉ | 1246/1563 [13:41<02:59, 1.77it/s] 80%|███████▉ | 1247/1563 [13:42<03:18, 1.59it/s] 80%|███████▉ | 1248/1563 [13:43<03:31, 1.49it/s] 80%|███████▉ | 1249/1563 [13:43<03:37, 1.45it/s] 80%|███████▉ | 1250/1563 [13:44<03:39, 1.43it/s] {'loss': 0.1414, 'grad_norm': 0.8828125, 'learning_rate': 4.0179142674344215e-06, 'epoch': 0.8} + 80%|███████▉ | 1250/1563 [13:44<03:39, 1.43it/s] 80%|████████ | 1251/1563 [13:45<03:53, 1.34it/s] 80%|████████ | 1252/1563 [13:46<04:01, 1.29it/s] 80%|████████ | 1253/1563 [13:46<03:43, 1.38it/s] 80%|████████ | 1254/1563 [13:47<03:20, 1.54it/s] 80%|████████ | 1255/1563 [13:47<03:09, 1.62it/s] 80%|████████ | 1256/1563 [13:48<03:22, 1.51it/s] 80%|████████ | 1257/1563 [13:49<03:17, 1.55it/s] 80%|████████ | 1258/1563 [13:49<03:18, 1.53it/s] 81%|████████ | 1259/1563 [13:50<03:05, 1.64it/s] 81%|████████ | 1260/1563 [13:51<03:22, 1.49it/s] 81%|████████ | 1261/1563 [13:51<03:22, 1.49it/s] 81%|████████ | 1262/1563 [13:52<03:11, 1.58it/s] 81%|████████ | 1263/1563 [13:52<02:54, 1.72it/s] 81%|████████ | 1264/1563 [13:53<03:00, 1.65it/s] 81%|████████ | 1265/1563 [13:54<03:24, 1.46it/s] 81%|████████ | 1266/1563 [13:55<03:25, 1.44it/s] 81%|████████ | 1267/1563 [13:55<03:22, 1.46it/s] 81%|████████ | 1268/1563 [13:56<03:20, 1.47it/s] 81%|████████ | 1269/1563 [13:57<03:05, 1.58it/s] 81%|████████▏ | 1270/1563 [13:57<03:22, 1.45it/s] 81%|████████▏ | 1271/1563 [13:58<03:20, 1.46it/s] 81%|████████▏ | 1272/1563 [13:59<03:19, 1.46it/s] 81%|████████▏ | 1273/1563 [13:59<03:15, 1.48it/s] 82%|████████▏ | 1274/1563 [14:00<03:31, 1.37it/s] 82%|████████▏ | 1275/1563 [14:01<03:25, 1.40it/s] 82%|████████▏ | 1276/1563 [14:02<03:38, 1.31it/s] 82%|████████��� | 1277/1563 [14:03<03:45, 1.27it/s] 82%|████████▏ | 1278/1563 [14:03<03:16, 1.45it/s] 82%|████████▏ | 1279/1563 [14:04<02:56, 1.61it/s] 82%|████████▏ | 1280/1563 [14:04<03:15, 1.45it/s] 82%|████████▏ | 1281/1563 [14:05<02:53, 1.63it/s] 82%|████████▏ | 1282/1563 [14:06<03:11, 1.46it/s] 82%|████████▏ | 1283/1563 [14:06<02:57, 1.58it/s] 82%|████████▏ | 1284/1563 [14:07<03:11, 1.46it/s] 82%|████████▏ | 1285/1563 [14:07<02:55, 1.58it/s] 82%|████████▏ | 1286/1563 [14:08<02:39, 1.74it/s] 82%|████████▏ | 1287/1563 [14:09<02:54, 1.58it/s] 82%|████████▏ | 1288/1563 [14:09<02:52, 1.59it/s] 82%|████████▏ | 1289/1563 [14:10<02:36, 1.75it/s] 83%|████████▎ | 1290/1563 [14:10<02:27, 1.85it/s] 83%|████████▎ | 1291/1563 [14:11<02:44, 1.65it/s] 83%|████████▎ | 1292/1563 [14:12<02:51, 1.58it/s] 83%|████████▎ | 1293/1563 [14:12<02:33, 1.76it/s] 83%|████████▎ | 1294/1563 [14:13<02:54, 1.54it/s] 83%|████████▎ | 1295/1563 [14:14<03:10, 1.41it/s] 83%|████████▎ | 1296/1563 [14:15<03:22, 1.32it/s] 83%|████████▎ | 1297/1563 [14:15<03:04, 1.44it/s] 83%|████████▎ | 1298/1563 [14:16<03:06, 1.42it/s] 83%|████████▎ | 1299/1563 [14:16<02:48, 1.56it/s] 83%|████████▎ | 1300/1563 [14:17<02:33, 1.71it/s] {'loss': 0.1409, 'grad_norm': 8.8125, 'learning_rate': 3.378119001919386e-06, 'epoch': 0.83} + 83%|████████▎ | 1300/1563 [14:17<02:33, 1.71it/s] 83%|████████▎ | 1301/1563 [14:18<02:44, 1.60it/s] 83%|████████▎ | 1302/1563 [14:18<02:26, 1.78it/s] 83%|████████▎ | 1303/1563 [14:18<02:18, 1.88it/s] 83%|████████▎ | 1304/1563 [14:19<02:38, 1.64it/s] 83%|████████▎ | 1305/1563 [14:20<02:34, 1.67it/s] 84%|████████▎ | 1306/1563 [14:20<02:23, 1.80it/s] 84%|████████▎ | 1307/1563 [14:21<02:40, 1.60it/s] 84%|████████▎ | 1308/1563 [14:22<02:54, 1.46it/s] 84%|████████▎ | 1309/1563 [14:22<02:37, 1.61it/s] 84%|████████▍ | 1310/1563 [14:23<02:43, 1.55it/s] 84%|████████▍ | 1311/1563 [14:24<02:33, 1.64it/s] 84%|████████▍ | 1312/1563 [14:24<02:24, 1.73it/s] 84%|████████▍ | 1313/1563 [14:25<02:31, 1.65it/s] 84%|████████▍ | 1314/1563 [14:25<02:20, 1.77it/s] 84%|████████▍ | 1315/1563 [14:26<02:19, 1.78it/s] 84%|████████▍ | 1316/1563 [14:27<02:38, 1.55it/s] 84%|████████▍ | 1317/1563 [14:27<02:30, 1.64it/s] 84%|████████▍ | 1318/1563 [14:28<02:19, 1.76it/s] 84%|████████▍ | 1319/1563 [14:28<02:09, 1.88it/s] 84%|████████▍ | 1320/1563 [14:29<02:28, 1.63it/s] 85%|████████▍ | 1321/1563 [14:29<02:17, 1.76it/s] 85%|████████▍ | 1322/1563 [14:30<02:22, 1.70it/s] 85%|████████▍ | 1323/1563 [14:31<02:18, 1.73it/s] 85%|████████▍ | 1324/1563 [14:31<02:22, 1.68it/s] 85%|████████▍ | 1325/1563 [14:32<02:21, 1.68it/s] 85%|████████▍ | 1326/1563 [14:33<02:39, 1.49it/s] 85%|████████▍ | 1327/1563 [14:33<02:44, 1.44it/s] 85%|████████▍ | 1328/1563 [14:34<02:25, 1.62it/s] 85%|████████▌ | 1329/1563 [14:34<02:24, 1.62it/s] 85%|████████▌ | 1330/1563 [14:35<02:16, 1.70it/s] 85%|████████▌ | 1331/1563 [14:36<02:33, 1.51it/s] 85%|████████▌ | 1332/1563 [14:36<02:35, 1.48it/s] 85%|████████▌ | 1333/1563 [14:37<02:17, 1.67it/s] 85%|████████▌ | 1334/1563 [14:37<02:14, 1.71it/s] 85%|████████▌ | 1335/1563 [14:38<02:02, 1.85it/s] 85%|████████▌ | 1336/1563 [14:39<02:22, 1.59it/s] 86%|████████▌ | 1337/1563 [14:40<02:34, 1.46it/s] 86%|████████▌ | 1338/1563 [14:40<02:45, 1.36it/s] 86%|████████▌ | 1339/1563 [14:41<02:49, 1.32it/s] 86%|████████▌ | 1340/1563 [14:42<02:55, 1.27it/s] 86%|████████▌ | 1341/1563 [14:43<02:47, 1.32it/s] 86%|████████▌ | 1342/1563 [14:43<02:44, 1.34it/s] 86%|████████▌ | 1343/1563 [14:44<02:39, 1.38it/s] 86%|████████▌ | 1344/1563 [14:45<02:46, 1.32it/s] 86%|████████▌ | 1345/1563 [14:46<02:51, 1.27it/s] 86%|████████▌ | 1346/1563 [14:46<02:36, 1.39it/s] 86%|████████▌ | 1347/1563 [14:47<02:44, 1.31it/s] 86%|████████▌ | 1348/1563 [14:48<02:29, 1.44it/s] 86%|████████▋ | 1349/1563 [14:48<02:19, 1.53it/s] 86%|████████▋ | 1350/1563 [14:49<02:11, 1.62it/s] {'loss': 0.1395, 'grad_norm': 12.1875, 'learning_rate': 2.738323736404351e-06, 'epoch': 0.86} + 86%|████████▋ | 1350/1563 [14:49<02:11, 1.62it/s] 86%|████████▋ | 1351/1563 [14:49<02:09, 1.63it/s] 87%|████████▋ | 1352/1563 [14:50<02:25, 1.45it/s] 87%|████████▋ | 1353/1563 [14:51<02:32, 1.37it/s] 87%|████████▋ | 1354/1563 [14:52<02:19, 1.50it/s] 87%|████████▋ | 1355/1563 [14:52<02:24, 1.44it/s] 87%|████████▋ | 1356/1563 [14:53<02:34, 1.34it/s] 87%|████████▋ | 1357/1563 [14:54<02:13, 1.54it/s] 87%|████████▋ | 1358/1563 [14:54<02:18, 1.49it/s] 87%|████████▋ | 1359/1563 [14:55<02:28, 1.37it/s] 87%|████████▋ | 1360/1563 [14:56<02:25, 1.40it/s] 87%|████████▋ | 1361/1563 [14:57<02:27, 1.37it/s] 87%|████████▋ | 1362/1563 [14:57<02:25, 1.38it/s] 87%|████████▋ | 1363/1563 [14:58<02:31, 1.32it/s] 87%|████████▋ | 1364/1563 [14:59<02:32, 1.30it/s] 87%|████████▋ | 1365/1563 [15:00<02:19, 1.42it/s] 87%|████████▋ | 1366/1563 [15:01<02:26, 1.34it/s] 87%|████████▋ | 1367/1563 [15:01<02:14, 1.46it/s] 88%|████████▊ | 1368/1563 [15:02<02:23, 1.36it/s] 88%|████████▊ | 1369/1563 [15:03<02:30, 1.29it/s] 88%|████████▊ | 1370/1563 [15:03<02:20, 1.37it/s] 88%|████████▊ | 1371/1563 [15:04<02:04, 1.54it/s] 88%|████████▊ | 1372/1563 [15:05<02:06, 1.51it/s] 88%|████████▊ | 1373/1563 [15:05<02:14, 1.42it/s] 88%|████████▊ | 1374/1563 [15:06<02:00, 1.57it/s] 88%|████████▊ | 1375/1563 [15:06<01:49, 1.72it/s] 88%|████████▊ | 1376/1563 [15:07<01:47, 1.75it/s] 88%|████████▊ | 1377/1563 [15:07<01:48, 1.72it/s] 88%|████████▊ | 1378/1563 [15:08<01:38, 1.88it/s] 88%|████████▊ | 1379/1563 [15:08<01:38, 1.87it/s] 88%|████████▊ | 1380/1563 [15:09<01:34, 1.94it/s] 88%|████████▊ | 1381/1563 [15:09<01:29, 2.04it/s] 88%|████████▊ | 1382/1563 [15:10<01:27, 2.07it/s] 88%|████████▊ | 1383/1563 [15:10<01:32, 1.95it/s] 89%|████████▊ | 1384/1563 [15:11<01:28, 2.03it/s] 89%|████████▊ | 1385/1563 [15:11<01:29, 1.99it/s] 89%|████████▊ | 1386/1563 [15:12<01:44, 1.70it/s] 89%|████████▊ | 1387/1563 [15:13<01:36, 1.83it/s] 89%|████████▉ | 1388/1563 [15:13<01:50, 1.58it/s] 89%|████████▉ | 1389/1563 [15:14<01:42, 1.70it/s] 89%|████████▉ | 1390/1563 [15:14<01:38, 1.75it/s] 89%|████████▉ | 1391/1563 [15:15<01:34, 1.82it/s] 89%|████████▉ | 1392/1563 [15:15<01:31, 1.86it/s] 89%|████████▉ | 1393/1563 [15:16<01:48, 1.57it/s] 89%|████████▉ | 1394/1563 [15:17<01:51, 1.52it/s] 89%|████████▉ | 1395/1563 [15:18<01:53, 1.48it/s] 89%|████████▉ | 1396/1563 [15:19<02:01, 1.37it/s] 89%|████████▉ | 1397/1563 [15:19<02:02, 1.35it/s] 89%|████████▉ | 1398/1563 [15:20<02:08, 1.28it/s] 90%|████████▉ | 1399/1563 [15:21<01:52, 1.46it/s] 90%|████████▉ | 1400/1563 [15:21<01:40, 1.62it/s] {'loss': 0.1385, 'grad_norm': 0.73828125, 'learning_rate': 2.0985284708893156e-06, 'epoch': 0.9} + 90%|████████▉ | 1400/1563 [15:21<01:40, 1.62it/s] 90%|████████▉ | 1401/1563 [15:22<01:33, 1.73it/s] 90%|████████▉ | 1402/1563 [15:22<01:26, 1.86it/s] 90%|████████▉ | 1403/1563 [15:23<01:22, 1.94it/s] 90%|████████▉ | 1404/1563 [15:23<01:26, 1.85it/s] 90%|████████▉ | 1405/1563 [15:24<01:40, 1.58it/s] 90%|████████▉ | 1406/1563 [15:25<01:49, 1.43it/s] 90%|█████████ | 1407/1563 [15:26<01:48, 1.44it/s] 90%|█████████ | 1408/1563 [15:26<01:36, 1.60it/s] 90%|█████████ | 1409/1563 [15:26<01:30, 1.70it/s] 90%|█████████ | 1410/1563 [15:27<01:42, 1.49it/s] 90%|█████████ | 1411/1563 [15:28<01:42, 1.49it/s] 90%|█████████ | 1412/1563 [15:28<01:30, 1.68it/s] 90%|█████████ | 1413/1563 [15:29<01:22, 1.82it/s] 90%|██████���██ | 1414/1563 [15:29<01:18, 1.91it/s] 91%|█████████ | 1415/1563 [15:30<01:25, 1.72it/s] 91%|█████████ | 1416/1563 [15:31<01:37, 1.51it/s] 91%|█████████ | 1417/1563 [15:32<01:36, 1.52it/s] 91%|█████████ | 1418/1563 [15:32<01:35, 1.52it/s] 91%|█████████ | 1419/1563 [15:33<01:41, 1.42it/s] 91%|█████████ | 1420/1563 [15:34<01:40, 1.42it/s] 91%|█████████ | 1421/1563 [15:34<01:35, 1.49it/s] 91%|█████████ | 1422/1563 [15:35<01:33, 1.51it/s] 91%|█████████ | 1423/1563 [15:36<01:35, 1.47it/s] 91%|█████████ | 1424/1563 [15:36<01:30, 1.53it/s] 91%|█████████ | 1425/1563 [15:37<01:29, 1.55it/s] 91%|█████████ | 1426/1563 [15:37<01:23, 1.63it/s] 91%|█████████▏| 1427/1563 [15:38<01:32, 1.47it/s] 91%|█████████▏| 1428/1563 [15:39<01:30, 1.48it/s] 91%|█████████▏| 1429/1563 [15:40<01:35, 1.40it/s] 91%|█████████▏| 1430/1563 [15:40<01:31, 1.45it/s] 92%|█████████▏| 1431/1563 [15:41<01:28, 1.49it/s] 92%|█████████▏| 1432/1563 [15:42<01:22, 1.58it/s] 92%|█████████▏| 1433/1563 [15:42<01:26, 1.50it/s] 92%|█████████▏| 1434/1563 [15:43<01:33, 1.39it/s] 92%|█████████▏| 1435/1563 [15:44<01:38, 1.30it/s] 92%|█████████▏| 1436/1563 [15:45<01:39, 1.27it/s] 92%|█████████▏| 1437/1563 [15:46<01:42, 1.23it/s] 92%|█████████▏| 1438/1563 [15:47<01:41, 1.23it/s] 92%|█████████▏| 1439/1563 [15:47<01:35, 1.30it/s] 92%|█████████▏| 1440/1563 [15:48<01:35, 1.29it/s] 92%|█████████▏| 1441/1563 [15:49<01:37, 1.25it/s] 92%|█████████▏| 1442/1563 [15:49<01:25, 1.41it/s] 92%|█████████▏| 1443/1563 [15:50<01:30, 1.33it/s] 92%|█████████▏| 1444/1563 [15:51<01:21, 1.46it/s] 92%|█████████▏| 1445/1563 [15:51<01:16, 1.55it/s] 93%|█████████▎| 1446/1563 [15:52<01:14, 1.56it/s] 93%|█████████▎| 1447/1563 [15:52<01:12, 1.59it/s] 93%|█████████▎| 1448/1563 [15:53<01:12, 1.59it/s] 93%|█████████▎| 1449/1563 [15:54<01:06, 1.72it/s] 93%|█████████▎| 1450/1563 [15:54<01:02, 1.80it/s] {'loss': 0.1388, 'grad_norm': 14.0625, 'learning_rate': 1.4587332053742803e-06, 'epoch': 0.93} + 93%|█████████▎| 1450/1563 [15:54<01:02, 1.80it/s] 93%|█████████▎| 1451/1563 [15:55<01:05, 1.70it/s] 93%|█████████▎| 1452/1563 [15:55<01:01, 1.79it/s] 93%|█████████▎| 1453/1563 [15:56<01:07, 1.63it/s] 93%|█████████▎| 1454/1563 [15:56<01:01, 1.76it/s] 93%|█████████▎| 1455/1563 [15:57<01:04, 1.67it/s] 93%|█████████▎| 1456/1563 [15:58<01:03, 1.69it/s] 93%|█████████▎| 1457/1563 [15:58<01:05, 1.62it/s] 93%|█████████▎| 1458/1563 [15:59<00:59, 1.77it/s] 93%|█████████▎| 1459/1563 [15:59<00:54, 1.92it/s] 93%|█████████▎| 1460/1563 [16:00<00:58, 1.76it/s] 93%|█████████▎| 1461/1563 [16:01<01:04, 1.57it/s] 94%|█████████▎| 1462/1563 [16:01<01:07, 1.49it/s] 94%|█████████▎| 1463/1563 [16:02<01:12, 1.38it/s] 94%|█████████▎| 1464/1563 [16:03<01:04, 1.54it/s] 94%|█████████▎| 1465/1563 [16:03<01:03, 1.55it/s] 94%|█████████▍| 1466/1563 [16:04<01:09, 1.40it/s] 94%|█████████▍| 1467/1563 [16:05<01:09, 1.38it/s] 94%|█████████▍| 1468/1563 [16:05<01:00, 1.56it/s] 94%|█████████▍| 1469/1563 [16:06<00:54, 1.72it/s] 94%|█████████▍| 1470/1563 [16:07<01:01, 1.52it/s] 94%|█████████▍| 1471/1563 [16:07<01:00, 1.53it/s] 94%|█████████▍| 1472/1563 [16:08<00:55, 1.64it/s] 94%|█████████▍| 1473/1563 [16:08<00:52, 1.73it/s] 94%|█████████▍| 1474/1563 [16:09<00:57, 1.54it/s] 94%|█████████▍| 1475/1563 [16:10<00:52, 1.69it/s] 94%|█████████▍| 1476/1563 [16:10<00:56, 1.53it/s] 94%|█████████▍| 1477/1563 [16:11<00:53, 1.61it/s] 95%|█████████▍| 1478/1563 [16:12<00:50, 1.68it/s] 95%|█████████▍| 1479/1563 [16:12<00:47, 1.76it/s] 95%|█████████▍| 1480/1563 [16:13<00:50, 1.64it/s] 95%|█████████▍| 1481/1563 [16:14<00:54, 1.50it/s] 95%|█████████▍| 1482/1563 [16:14<00:57, 1.40it/s] 95%|█████████▍| 1483/1563 [16:15<00:56, 1.42it/s] 95%|█████████▍| 1484/1563 [16:16<00:55, 1.43it/s] 95%|█████████▌| 1485/1563 [16:17<00:58, 1.33it/s] 95%|█████████▌| 1486/1563 [16:17<00:59, 1.29it/s] 95%|█████████▌| 1487/1563 [16:18<00:58, 1.29it/s] 95%|█████████▌| 1488/1563 [16:19<00:56, 1.33it/s] 95%|█████████▌| 1489/1563 [16:20<00:54, 1.36it/s] 95%|█████████▌| 1490/1563 [16:20<00:55, 1.31it/s] 95%|█████████▌| 1491/1563 [16:21<00:48, 1.49it/s] 95%|█████████▌| 1492/1563 [16:21<00:44, 1.59it/s] 96%|█████████▌| 1493/1563 [16:22<00:41, 1.69it/s] 96%|█████████▌| 1494/1563 [16:23<00:39, 1.74it/s] 96%|█████████▌| 1495/1563 [16:23<00:39, 1.73it/s] 96%|█████████▌| 1496/1563 [16:24<00:41, 1.60it/s] 96%|█████████▌| 1497/1563 [16:25<00:43, 1.53it/s] 96%|█████████▌| 1498/1563 [16:25<00:44, 1.46it/s] 96%|█████████▌| 1499/1563 [16:26<00:39, 1.60it/s] 96%|█████████▌| 1500/1563 [16:26<00:40, 1.57it/s] {'loss': 0.1387, 'grad_norm': 1.71875, 'learning_rate': 8.18937939859245e-07, 'epoch': 0.96} + 96%|█████████▌| 1500/1563 [16:27<00:40, 1.57it/s] 96%|█████████▌| 1501/1563 [16:27<00:43, 1.41it/s] 96%|█████████▌| 1502/1563 [16:28<00:39, 1.54it/s] 96%|█████████▌| 1503/1563 [16:29<00:40, 1.49it/s] 96%|█████████▌| 1504/1563 [16:29<00:40, 1.44it/s] 96%|█████████▋| 1505/1563 [16:30<00:39, 1.47it/s] 96%|█████████▋| 1506/1563 [16:31<00:41, 1.39it/s] 96%|█████████▋| 1507/1563 [16:31<00:36, 1.54it/s] 96%|█████████▋| 1508/1563 [16:32<00:39, 1.40it/s] 97%|█████████▋| 1509/1563 [16:33<00:40, 1.33it/s] 97%|█████████▋| 1510/1563 [16:33<00:35, 1.50it/s] 97%|█████████▋| 1511/1563 [16:34<00:36, 1.41it/s] 97%|█████████▋| 1512/1563 [16:35<00:33, 1.53it/s] 97%|█████████▋| 1513/1563 [16:35<00:29, 1.68it/s] 97%|█████████▋| 1514/1563 [16:36<00:27, 1.77it/s] 97%|█████████▋| 1515/1563 [16:36<00:28, 1.69it/s] 97%|█████████▋| 1516/1563 [16:37<00:26, 1.81it/s] 97%|█████████▋| 1517/1563 [16:38<00:29, 1.56it/s] 97%|█████████▋| 1518/1563 [16:39<00:31, 1.43it/s] 97%|█████████▋| 1519/1563 [16:39<00:27, 1.61it/s] 97%|█████████▋| 1520/1563 [16:40<00:29, 1.44it/s] 97%|█████████▋| 1521/1563 [16:40<00:26, 1.61it/s] 97%|█████████▋| 1522/1563 [16:41<00:28, 1.44it/s] 97%|█████████▋| 1523/1563 [16:42<00:28, 1.40it/s] 98%|█████████▊| 1524/1563 [16:43<00:29, 1.33it/s] 98%|█████████▊| 1525/1563 [16:43<00:24, 1.52it/s] 98%|█████████▊| 1526/1563 [16:44<00:21, 1.70it/s] 98%|█████████▊| 1527/1563 [16:44<00:21, 1.71it/s] 98%|█████████▊| 1528/1563 [16:45<00:22, 1.56it/s] 98%|█████████▊| 1529/1563 [16:45<00:20, 1.68it/s] 98%|█████████▊| 1530/1563 [16:46<00:19, 1.67it/s] 98%|█████████▊| 1531/1563 [16:47<00:18, 1.77it/s] 98%|█████████▊| 1532/1563 [16:47<00:18, 1.68it/s] 98%|█████████▊| 1533/1563 [16:48<00:18, 1.58it/s] 98%|█████████▊| 1534/1563 [16:49<00:20, 1.44it/s] 98%|█████████▊| 1535/1563 [16:49<00:17, 1.59it/s] 98%|█████████▊| 1536/1563 [16:50<00:18, 1.48it/s] 98%|█████████▊| 1537/1563 [16:51<00:20, 1.30it/s] 98%|█████████▊| 1538/1563 [16:52<00:19, 1.31it/s] 98%|█████████▊| 1539/1563 [16:52<00:17, 1.40it/s] 99%|█████████▊| 1540/1563 [16:53<00:15, 1.44it/s] 99%|█████████▊| 1541/1563 [16:54<00:14, 1.49it/s] 99%|█████████▊| 1542/1563 [16:54<00:12, 1.65it/s] 99%|█████████▊| 1543/1563 [16:55<00:11, 1.75it/s] 99%|█████████▉| 1544/1563 [16:55<00:10, 1.88it/s] 99%|█████████▉| 1545/1563 [16:56<00:10, 1.72it/s] 99%|█████████▉| 1546/1563 [16:56<00:10, 1.57it/s] 99%|█████████▉| 1547/1563 [16:57<00:10, 1.55it/s] 99%|█████████▉| 1548/1563 [16:58<00:08, 1.67it/s] 99%|█████████▉| 1549/1563 [16:58<00:07, 1.76it/s] 99%|█████████▉| 1550/1563 [16:59<00:08, 1.53it/s] {'loss': 0.1407, 'grad_norm': 2.4375, 'learning_rate': 1.7914267434420988e-07, 'epoch': 0.99} + 99%|█████████▉| 1550/1563 [16:59<00:08, 1.53it/s] 99%|█████████▉| 1551/1563 [16:59<00:07, 1.69it/s] 99%|█████████▉| 1552/1563 [17:00<00:07, 1.49it/s] 99%|█████████▉| 1553/1563 [17:01<00:06, 1.48it/s] 99%|█████████▉| 1554/1563 [17:02<00:06, 1.37it/s] 99%|█████████▉| 1555/1563 [17:02<00:05, 1.45it/s] 100%|█████████▉| 1556/1563 [17:03<00:04, 1.53it/s] 100%|█████████▉| 1557/1563 [17:03<00:03, 1.66it/s] 100%|█████████▉| 1558/1563 [17:04<00:03, 1.47it/s] 100%|█████████▉| 1559/1563 [17:05<00:02, 1.47it/s] 100%|█████████▉| 1560/1563 [17:06<00:02, 1.37it/s] 100%|█████████▉| 1561/1563 [17:07<00:01, 1.36it/s] 100%|█████████▉| 1562/1563 [17:07<00:00, 1.50it/s] 100%|██████████| 1563/1563 [17:08<00:00, 1.51it/s] {'train_runtime': 1033.8148, 'train_samples_per_second': 193.458, 'train_steps_per_second': 1.512, 'train_loss': 0.2836286289449388, 'epoch': 1.0} + 100%|██████████| 1563/1563 [17:12<00:00, 1.51it/s] 100%|██████████| 1563/1563 [17:12<00:00, 1.51it/s] + model.safetensors: 0%| | 0.00/2.00G [00:00