File size: 29,333 Bytes
8c97474
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
[2025-05-14 20:58:45] Created output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
[2025-05-14 20:58:45] Chat mode disabled
[2025-05-14 20:58:45] Model size is 3B or smaller (1 B). Using full fine-tuning.
[2025-05-14 20:58:45] No QA format data will be used
[2025-05-14 20:58:45] Limiting dataset size to: 100 samples
[2025-05-14 20:58:45] =======================================
[2025-05-14 20:58:45] Starting training for model: google/gemma-3-1b-pt
[2025-05-14 20:58:45] =======================================
[2025-05-14 20:58:45] CUDA_VISIBLE_DEVICES: 0,1,2,3
[2025-05-14 20:58:45] WANDB_PROJECT: wikidyk-ar
[2025-05-14 20:58:45] DATA_PATH: data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2_trainqas.json
[2025-05-14 20:58:45] Global Batch Size: 512
[2025-05-14 20:58:45] Data Size: 100
[2025-05-14 20:58:45] Executing command: torchrun --nproc_per_node "4" --master-port 29581 src/train.py       --model_name_or_path "google/gemma-3-1b-pt"       --data_path "data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2_trainqas.json"       --output_dir "train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000"       --num_upsample "1000"       --per_device_train_batch_size "128"       --gradient_accumulation_steps "1"       --learning_rate "2e-5"       --num_train_epochs "1"       --model_max_length "32768"       --report_to wandb --logging_steps 50       --save_strategy steps --save_steps 10000       --save_total_limit 3       --resume_from_checkpoint True       --bf16 True --use_flash_attention_2 True       --qa_data_ratio "-1"       --predict_mask "false"                            --ds_size 100
[2025-05-14 20:58:45] Training started at Wed May 14 20:58:45 UTC 2025
W0514 20:58:46.997000 610149 site-packages/torch/distributed/run.py:792] 
W0514 20:58:46.997000 610149 site-packages/torch/distributed/run.py:792] *****************************************
W0514 20:58:46.997000 610149 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0514 20:58:46.997000 610149 site-packages/torch/distributed/run.py:792] *****************************************
WARNING:root:Output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
WARNING:root:Output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
WARNING:root:Output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
WARNING:root:Output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
WARNING:root:Loading data...
WARNING:root:Loading data...
WARNING:root:Loading data...
WARNING:root:Loading data...
WARNING:root:Dataset initialized with all QA data:
WARNING:root:  - 100000 QA examples
WARNING:root:  - 100 fact examples with upsampling factor 1000
WARNING:root:  - Total examples: 200000
/root/yuwei/WikiDYKEvalV2/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
WARNING:root:Dataset initialized with all QA data:
WARNING:root:  - 100000 QA examples
WARNING:root:  - 100 fact examples with upsampling factor 1000
WARNING:root:  - Total examples: 200000
WARNING:root:Dataset initialized with all QA data:
WARNING:root:  - 100000 QA examples
WARNING:root:  - 100 fact examples with upsampling factor 1000
WARNING:root:  - Total examples: 200000
/root/yuwei/WikiDYKEvalV2/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
/root/yuwei/WikiDYKEvalV2/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
WARNING:root:Dataset initialized with all QA data:
WARNING:root:  - 100000 QA examples
WARNING:root:  - 100 fact examples with upsampling factor 1000
WARNING:root:  - Total examples: 200000
/root/yuwei/WikiDYKEvalV2/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
Checkpoint missing; starting training from scratch
Checkpoint missing; starting training from scratch
Checkpoint missing; starting training from scratch
Checkpoint missing; starting training from scratch
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
wandb: Currently logged in as: yuweiz to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.19.11
wandb: Run data is saved locally in /root/yuwei/WikiDYKEvalV2/wandb/run-20250514_205901-64zk7otl
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
wandb: ⭐️ View project at https://wandb.ai/yuweiz/wikidyk-ar
wandb: 🚀 View run at https://wandb.ai/yuweiz/wikidyk-ar/runs/64zk7otl

  0%|          | 0/391 [00:00<?, ?it/s]It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
[rank3]:[W514 20:59:03.405376210 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W514 20:59:03.408516336 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank2]:[W514 20:59:03.435576812 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank1]:[W514 20:59:03.436675735 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

  0%|          | 1/391 [00:02<18:26,  2.84s/it][rank1]: Traceback (most recent call last):
[rank1]:   File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 122, in train
[rank1]:     trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2213, in train
[rank1]:     raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
[rank1]: ValueError: No valid checkpoint found in output directory (train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000)

[rank1]: During handling of the above exception, another exception occurred:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 134, in <module>
[rank1]:     train()
[rank1]:   File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 126, in train
[rank1]:     trainer.train()
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3736, in training_step
[rank1]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3801, in compute_loss
[rank1]:     outputs = model(**inputs)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
[rank1]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
[rank1]:     return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/accelerate/utils/operations.py", line 814, in forward
[rank1]:     return model_forward(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/accelerate/utils/operations.py", line 802, in __call__
[rank1]:     return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/generic.py", line 965, in wrapper
[rank1]:     output = func(self, *args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/models/gemma3/modeling_gemma3.py", line 966, in forward
[rank1]:     loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 63, in ForCausalLMLoss
[rank1]:     loss = fixed_cross_entropy(logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 35, in fixed_cross_entropy
[rank1]:     loss = nn.functional.cross_entropy(source, target, ignore_index=ignore_index, reduction=reduction)
[rank1]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/functional.py", line 3494, in cross_entropy
[rank1]:     return torch._C._nn.cross_entropy_loss(
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.12 GiB. GPU 1 has a total capacity of 79.25 GiB of which 5.79 GiB is free. Process 967603 has 33.68 GiB memory in use. Process 1012455 has 39.77 GiB memory in use. Of the allocated memory 33.76 GiB is allocated by PyTorch, and 4.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
[rank2]:   File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 122, in train
[rank2]:     trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank2]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2213, in train
[rank2]:     raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
[rank2]: ValueError: No valid checkpoint found in output directory (train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000)

[rank2]: During handling of the above exception, another exception occurred:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 134, in <module>
[rank2]:     train()
[rank2]:   File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 126, in train
[rank2]:     trainer.train()
[rank2]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank2]:     return inner_training_loop(
[rank2]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank2]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank2]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3736, in training_step
[rank2]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank2]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3801, in compute_loss
[rank2]:     outputs = model(**inputs)
[rank2]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1645, in forward
[rank2]:     return self._post_forward(output)
[rank2]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1620, in _post_forward
[rank2]:     passthrough_tensor_list = _DDPSink.apply(
[rank2]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank2]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank2]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 250, in forward
[rank2]:     ret = tuple(
[rank2]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 251, in <genexpr>
[rank2]:     inp.clone() if isinstance(inp, torch.Tensor) else inp for inp in inputs
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.12 GiB. GPU 2 has a total capacity of 79.25 GiB of which 965.44 MiB is free. Process 967604 has 33.77 GiB memory in use. Process 1012456 has 44.53 GiB memory in use. Of the allocated memory 36.89 GiB is allocated by PyTorch, and 5.95 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 122, in train
    trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2213, in train
    raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
ValueError: No valid checkpoint found in output directory (train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 134, in <module>
    train()
  File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 126, in train
    trainer.train()
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3736, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3801, in compute_loss
    outputs = model(**inputs)
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1645, in forward
    return self._post_forward(output)
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1620, in _post_forward
    passthrough_tensor_list = _DDPSink.apply(
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 250, in forward
    ret = tuple(
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 251, in <genexpr>
    inp.clone() if isinstance(inp, torch.Tensor) else inp for inp in inputs
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.12 GiB. GPU 0 has a total capacity of 79.25 GiB of which 1.52 GiB is free. Process 967602 has 31.72 GiB memory in use. Process 1012454 has 45.99 GiB memory in use. Of the allocated memory 36.79 GiB is allocated by PyTorch, and 7.59 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 122, in train
[rank0]:     trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2213, in train
[rank0]:     raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
[rank0]: ValueError: No valid checkpoint found in output directory (train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000)

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 134, in <module>
[rank0]:     train()
[rank0]:   File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 126, in train
[rank0]:     trainer.train()
[rank0]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3736, in training_step
[rank0]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank0]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3801, in compute_loss
[rank0]:     outputs = model(**inputs)
[rank0]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1645, in forward
[rank0]:     return self._post_forward(output)
[rank0]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1620, in _post_forward
[rank0]:     passthrough_tensor_list = _DDPSink.apply(
[rank0]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 250, in forward
[rank0]:     ret = tuple(
[rank0]:   File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 251, in <genexpr>
[rank0]:     inp.clone() if isinstance(inp, torch.Tensor) else inp for inp in inputs
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.12 GiB. GPU 0 has a total capacity of 79.25 GiB of which 1.52 GiB is free. Process 967602 has 31.72 GiB memory in use. Process 1012454 has 45.99 GiB memory in use. Of the allocated memory 36.79 GiB is allocated by PyTorch, and 7.59 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
wandb: 
wandb: 🚀 View run train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000 at: https://wandb.ai/yuweiz/wikidyk-ar/runs/64zk7otl
wandb: Find logs at: wandb/run-20250514_205901-64zk7otl/logs
W0514 20:59:09.640000 610149 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 610214 closing signal SIGTERM
W0514 20:59:09.641000 610149 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 610215 closing signal SIGTERM
W0514 20:59:09.641000 610149 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 610217 closing signal SIGTERM
E0514 20:59:10.219000 610149 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 610216) of binary: /root/miniconda3/envs/wikidyk/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/wikidyk/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
src/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-14_20:59:09
  host      : bb9aa167977b
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 610216)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2025-05-14 20:59:10] ERROR: Training failed for google/gemma-3-1b-pt with exit code 1
[2025-05-14 20:59:10] ERROR: Training failed for google/gemma-3-1b-pt with exit code 1
[2025-05-14 20:59:10] Check error log for details: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000/20250514_205100.log
[2025-05-14 20:59:10] Resource usage after training google/gemma-3-1b-pt:
[2025-05-14 20:59:10] GPU memory usage:
32495 MiB, 81920 MiB
34501 MiB, 81920 MiB
34591 MiB, 81920 MiB
32659 MiB, 81920 MiB
[2025-05-14 20:59:10] Disk space usage for model outputs:
32K	train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
[2025-05-14 20:59:10] 
[2025-05-14 20:59:10] All training runs completed at Wed May 14 20:59:10 UTC 2025
[2025-05-14 20:59:10] =======================================
[2025-05-14 20:59:10] Summary of training runs:
[2025-05-14 20:59:10] Model | Status | Duration | Output Size