YWZBrandon commited on
Commit
8c97474
·
verified ·
1 Parent(s): bd5d443

End of training

Browse files
20250514_205100.log ADDED
@@ -0,0 +1,295 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0
  0%| | 0/391 [00:00<?, ?it/s]It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
 
 
 
 
 
1
  0%| | 1/391 [00:02<18:26, 2.84s/it][rank1]: Traceback (most recent call last):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [2025-05-14 20:58:45] Created output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
2
+ [2025-05-14 20:58:45] Chat mode disabled
3
+ [2025-05-14 20:58:45] Model size is 3B or smaller (1 B). Using full fine-tuning.
4
+ [2025-05-14 20:58:45] No QA format data will be used
5
+ [2025-05-14 20:58:45] Limiting dataset size to: 100 samples
6
+ [2025-05-14 20:58:45] =======================================
7
+ [2025-05-14 20:58:45] Starting training for model: google/gemma-3-1b-pt
8
+ [2025-05-14 20:58:45] =======================================
9
+ [2025-05-14 20:58:45] CUDA_VISIBLE_DEVICES: 0,1,2,3
10
+ [2025-05-14 20:58:45] WANDB_PROJECT: wikidyk-ar
11
+ [2025-05-14 20:58:45] DATA_PATH: data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2_trainqas.json
12
+ [2025-05-14 20:58:45] Global Batch Size: 512
13
+ [2025-05-14 20:58:45] Data Size: 100
14
+ [2025-05-14 20:58:45] Executing command: torchrun --nproc_per_node "4" --master-port 29581 src/train.py --model_name_or_path "google/gemma-3-1b-pt" --data_path "data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2_trainqas.json" --output_dir "train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000" --num_upsample "1000" --per_device_train_batch_size "128" --gradient_accumulation_steps "1" --learning_rate "2e-5" --num_train_epochs "1" --model_max_length "32768" --report_to wandb --logging_steps 50 --save_strategy steps --save_steps 10000 --save_total_limit 3 --resume_from_checkpoint True --bf16 True --use_flash_attention_2 True --qa_data_ratio "-1" --predict_mask "false" --ds_size 100
15
+ [2025-05-14 20:58:45] Training started at Wed May 14 20:58:45 UTC 2025
16
+ W0514 20:58:46.997000 610149 site-packages/torch/distributed/run.py:792]
17
+ W0514 20:58:46.997000 610149 site-packages/torch/distributed/run.py:792] *****************************************
18
+ W0514 20:58:46.997000 610149 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
19
+ W0514 20:58:46.997000 610149 site-packages/torch/distributed/run.py:792] *****************************************
20
+ WARNING:root:Output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
21
+ The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
22
+ You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
23
+ WARNING:root:Output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
24
+ WARNING:root:Output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
25
+ WARNING:root:Output directory: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
26
+ The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
27
+ You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
28
+ The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
29
+ You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
30
+ The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
31
+ You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
32
+ WARNING:root:Loading data...
33
+ WARNING:root:Loading data...
34
+ WARNING:root:Loading data...
35
+ WARNING:root:Loading data...
36
+ WARNING:root:Dataset initialized with all QA data:
37
+ WARNING:root: - 100000 QA examples
38
+ WARNING:root: - 100 fact examples with upsampling factor 1000
39
+ WARNING:root: - Total examples: 200000
40
+ /root/yuwei/WikiDYKEvalV2/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
41
+ trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
42
+ WARNING:root:Dataset initialized with all QA data:
43
+ WARNING:root: - 100000 QA examples
44
+ WARNING:root: - 100 fact examples with upsampling factor 1000
45
+ WARNING:root: - Total examples: 200000
46
+ WARNING:root:Dataset initialized with all QA data:
47
+ WARNING:root: - 100000 QA examples
48
+ WARNING:root: - 100 fact examples with upsampling factor 1000
49
+ WARNING:root: - Total examples: 200000
50
+ /root/yuwei/WikiDYKEvalV2/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
51
+ trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
52
+ /root/yuwei/WikiDYKEvalV2/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
53
+ trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
54
+ WARNING:root:Dataset initialized with all QA data:
55
+ WARNING:root: - 100000 QA examples
56
+ WARNING:root: - 100 fact examples with upsampling factor 1000
57
+ WARNING:root: - Total examples: 200000
58
+ /root/yuwei/WikiDYKEvalV2/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
59
+ trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
60
+ Checkpoint missing; starting training from scratch
61
+ Checkpoint missing; starting training from scratch
62
+ Checkpoint missing; starting training from scratch
63
+ Checkpoint missing; starting training from scratch
64
+ wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
65
+ It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
66
+ It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
67
+ It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
68
+ wandb: Currently logged in as: yuweiz to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
69
+ wandb: Tracking run with wandb version 0.19.11
70
+ wandb: Run data is saved locally in /root/yuwei/WikiDYKEvalV2/wandb/run-20250514_205901-64zk7otl
71
+ wandb: Run `wandb offline` to turn off syncing.
72
+ wandb: Syncing run train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
73
+ wandb: ⭐️ View project at https://wandb.ai/yuweiz/wikidyk-ar
74
+ wandb: 🚀 View run at https://wandb.ai/yuweiz/wikidyk-ar/runs/64zk7otl
75
+
76
  0%| | 0/391 [00:00<?, ?it/s]It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `flash_attention_2`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
77
+ [rank3]:[W514 20:59:03.405376210 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
78
+ [rank0]:[W514 20:59:03.408516336 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
79
+ [rank2]:[W514 20:59:03.435576812 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
80
+ [rank1]:[W514 20:59:03.436675735 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
81
+
82
  0%| | 1/391 [00:02<18:26, 2.84s/it][rank1]: Traceback (most recent call last):
83
+ [rank1]: File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 122, in train
84
+ [rank1]: trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
85
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2213, in train
86
+ [rank1]: raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
87
+ [rank1]: ValueError: No valid checkpoint found in output directory (train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000)
88
+
89
+ [rank1]: During handling of the above exception, another exception occurred:
90
+
91
+ [rank1]: Traceback (most recent call last):
92
+ [rank1]: File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 134, in <module>
93
+ [rank1]: train()
94
+ [rank1]: File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 126, in train
95
+ [rank1]: trainer.train()
96
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
97
+ [rank1]: return inner_training_loop(
98
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
99
+ [rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
100
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3736, in training_step
101
+ [rank1]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
102
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3801, in compute_loss
103
+ [rank1]: outputs = model(**inputs)
104
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
105
+ [rank1]: return self._call_impl(*args, **kwargs)
106
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
107
+ [rank1]: return forward_call(*args, **kwargs)
108
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
109
+ [rank1]: else self._run_ddp_forward(*inputs, **kwargs)
110
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
111
+ [rank1]: return self.module(*inputs, **kwargs) # type: ignore[index]
112
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
113
+ [rank1]: return self._call_impl(*args, **kwargs)
114
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
115
+ [rank1]: return forward_call(*args, **kwargs)
116
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/accelerate/utils/operations.py", line 814, in forward
117
+ [rank1]: return model_forward(*args, **kwargs)
118
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/accelerate/utils/operations.py", line 802, in __call__
119
+ [rank1]: return convert_to_fp32(self.model_forward(*args, **kwargs))
120
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
121
+ [rank1]: return func(*args, **kwargs)
122
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/generic.py", line 965, in wrapper
123
+ [rank1]: output = func(self, *args, **kwargs)
124
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
125
+ [rank1]: return func(*args, **kwargs)
126
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/models/gemma3/modeling_gemma3.py", line 966, in forward
127
+ [rank1]: loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)
128
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 63, in ForCausalLMLoss
129
+ [rank1]: loss = fixed_cross_entropy(logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
130
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 35, in fixed_cross_entropy
131
+ [rank1]: loss = nn.functional.cross_entropy(source, target, ignore_index=ignore_index, reduction=reduction)
132
+ [rank1]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/functional.py", line 3494, in cross_entropy
133
+ [rank1]: return torch._C._nn.cross_entropy_loss(
134
+ [rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.12 GiB. GPU 1 has a total capacity of 79.25 GiB of which 5.79 GiB is free. Process 967603 has 33.68 GiB memory in use. Process 1012455 has 39.77 GiB memory in use. Of the allocated memory 33.76 GiB is allocated by PyTorch, and 4.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
135
+ [rank2]: Traceback (most recent call last):
136
+ [rank2]: File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 122, in train
137
+ [rank2]: trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
138
+ [rank2]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2213, in train
139
+ [rank2]: raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
140
+ [rank2]: ValueError: No valid checkpoint found in output directory (train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000)
141
+
142
+ [rank2]: During handling of the above exception, another exception occurred:
143
+
144
+ [rank2]: Traceback (most recent call last):
145
+ [rank2]: File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 134, in <module>
146
+ [rank2]: train()
147
+ [rank2]: File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 126, in train
148
+ [rank2]: trainer.train()
149
+ [rank2]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
150
+ [rank2]: return inner_training_loop(
151
+ [rank2]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
152
+ [rank2]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
153
+ [rank2]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3736, in training_step
154
+ [rank2]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
155
+ [rank2]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3801, in compute_loss
156
+ [rank2]: outputs = model(**inputs)
157
+ [rank2]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
158
+ [rank2]: return self._call_impl(*args, **kwargs)
159
+ [rank2]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
160
+ [rank2]: return forward_call(*args, **kwargs)
161
+ [rank2]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1645, in forward
162
+ [rank2]: return self._post_forward(output)
163
+ [rank2]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1620, in _post_forward
164
+ [rank2]: passthrough_tensor_list = _DDPSink.apply(
165
+ [rank2]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
166
+ [rank2]: return super().apply(*args, **kwargs) # type: ignore[misc]
167
+ [rank2]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 250, in forward
168
+ [rank2]: ret = tuple(
169
+ [rank2]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 251, in <genexpr>
170
+ [rank2]: inp.clone() if isinstance(inp, torch.Tensor) else inp for inp in inputs
171
+ [rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.12 GiB. GPU 2 has a total capacity of 79.25 GiB of which 965.44 MiB is free. Process 967604 has 33.77 GiB memory in use. Process 1012456 has 44.53 GiB memory in use. Of the allocated memory 36.89 GiB is allocated by PyTorch, and 5.95 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
172
+ Traceback (most recent call last):
173
+ File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 122, in train
174
+ trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
175
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2213, in train
176
+ raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
177
+ ValueError: No valid checkpoint found in output directory (train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000)
178
+
179
+ During handling of the above exception, another exception occurred:
180
+
181
+ Traceback (most recent call last):
182
+ File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 134, in <module>
183
+ train()
184
+ File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 126, in train
185
+ trainer.train()
186
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
187
+ return inner_training_loop(
188
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
189
+ tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
190
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3736, in training_step
191
+ loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
192
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3801, in compute_loss
193
+ outputs = model(**inputs)
194
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
195
+ return self._call_impl(*args, **kwargs)
196
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
197
+ return forward_call(*args, **kwargs)
198
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1645, in forward
199
+ return self._post_forward(output)
200
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1620, in _post_forward
201
+ passthrough_tensor_list = _DDPSink.apply(
202
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
203
+ return super().apply(*args, **kwargs) # type: ignore[misc]
204
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 250, in forward
205
+ ret = tuple(
206
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 251, in <genexpr>
207
+ inp.clone() if isinstance(inp, torch.Tensor) else inp for inp in inputs
208
+ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.12 GiB. GPU 0 has a total capacity of 79.25 GiB of which 1.52 GiB is free. Process 967602 has 31.72 GiB memory in use. Process 1012454 has 45.99 GiB memory in use. Of the allocated memory 36.79 GiB is allocated by PyTorch, and 7.59 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
209
+ [rank0]: Traceback (most recent call last):
210
+ [rank0]: File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 122, in train
211
+ [rank0]: trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
212
+ [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2213, in train
213
+ [rank0]: raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
214
+ [rank0]: ValueError: No valid checkpoint found in output directory (train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000)
215
+
216
+ [rank0]: During handling of the above exception, another exception occurred:
217
+
218
+ [rank0]: Traceback (most recent call last):
219
+ [rank0]: File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 134, in <module>
220
+ [rank0]: train()
221
+ [rank0]: File "/root/yuwei/WikiDYKEvalV2/src/train.py", line 126, in train
222
+ [rank0]: trainer.train()
223
+ [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
224
+ [rank0]: return inner_training_loop(
225
+ [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
226
+ [rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
227
+ [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3736, in training_step
228
+ [rank0]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
229
+ [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 3801, in compute_loss
230
+ [rank0]: outputs = model(**inputs)
231
+ [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
232
+ [rank0]: return self._call_impl(*args, **kwargs)
233
+ [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
234
+ [rank0]: return forward_call(*args, **kwargs)
235
+ [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1645, in forward
236
+ [rank0]: return self._post_forward(output)
237
+ [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1620, in _post_forward
238
+ [rank0]: passthrough_tensor_list = _DDPSink.apply(
239
+ [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
240
+ [rank0]: return super().apply(*args, **kwargs) # type: ignore[misc]
241
+ [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 250, in forward
242
+ [rank0]: ret = tuple(
243
+ [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 251, in <genexpr>
244
+ [rank0]: inp.clone() if isinstance(inp, torch.Tensor) else inp for inp in inputs
245
+ [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.12 GiB. GPU 0 has a total capacity of 79.25 GiB of which 1.52 GiB is free. Process 967602 has 31.72 GiB memory in use. Process 1012454 has 45.99 GiB memory in use. Of the allocated memory 36.79 GiB is allocated by PyTorch, and 7.59 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
246
+ wandb:
247
+ wandb: 🚀 View run train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000 at: https://wandb.ai/yuweiz/wikidyk-ar/runs/64zk7otl
248
+ wandb: Find logs at: wandb/run-20250514_205901-64zk7otl/logs
249
+ W0514 20:59:09.640000 610149 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 610214 closing signal SIGTERM
250
+ W0514 20:59:09.641000 610149 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 610215 closing signal SIGTERM
251
+ W0514 20:59:09.641000 610149 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 610217 closing signal SIGTERM
252
+ E0514 20:59:10.219000 610149 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 610216) of binary: /root/miniconda3/envs/wikidyk/bin/python
253
+ Traceback (most recent call last):
254
+ File "/root/miniconda3/envs/wikidyk/bin/torchrun", line 8, in <module>
255
+ sys.exit(main())
256
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
257
+ return f(*args, **kwargs)
258
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
259
+ run(args)
260
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
261
+ elastic_launch(
262
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
263
+ return launch_agent(self._config, self._entrypoint, list(args))
264
+ File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
265
+ raise ChildFailedError(
266
+ torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
267
+ ============================================================
268
+ src/train.py FAILED
269
+ ------------------------------------------------------------
270
+ Failures:
271
+ <NO_OTHER_FAILURES>
272
+ ------------------------------------------------------------
273
+ Root Cause (first observed failure):
274
+ [0]:
275
+ time : 2025-05-14_20:59:09
276
+ host : bb9aa167977b
277
+ rank : 2 (local_rank: 2)
278
+ exitcode : 1 (pid: 610216)
279
+ error_file: <N/A>
280
+ traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
281
+ ============================================================
282
+ [2025-05-14 20:59:10] ERROR: Training failed for google/gemma-3-1b-pt with exit code 1
283
+ [2025-05-14 20:59:10] ERROR: Training failed for google/gemma-3-1b-pt with exit code 1
284
+ [2025-05-14 20:59:10] Check error log for details: train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000/20250514_205100.log
285
+ [2025-05-14 20:59:10] Resource usage after training google/gemma-3-1b-pt:
286
+ [2025-05-14 20:59:10] GPU memory usage:
287
+ 32495 MiB, 81920 MiB
288
+ 34501 MiB, 81920 MiB
289
+ 34591 MiB, 81920 MiB
290
+ 32659 MiB, 81920 MiB
291
+ [2025-05-14 20:59:10] Disk space usage for model outputs:
292
+ 32K train_results_pred_mask/google_gemma-3-1b-pt_qa_ds100_upsample1000
293
+ [2025-05-14 20:59:10]
294
+ [2025-05-14 20:59:10] All training runs completed at Wed May 14 20:59:10 UTC 2025
295
+ [2025-05-14 20:59:10] =======================================
296
+ [2025-05-14 20:59:10] Summary of training runs:
297
+ [2025-05-14 20:59:10] Model | Status | Duration | Output Size
20250514_214337.log ADDED
The diff for this file is too large to render. See raw diff
 
README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: gemma
4
+ base_model: google/gemma-3-1b-pt
5
+ tags:
6
+ - generated_from_trainer
7
+ model-index:
8
+ - name: google_gemma-3-1b-pt_qa_ds100_upsample1000
9
+ results: []
10
+ ---
11
+
12
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
+ should probably proofread and complete it, then remove this comment. -->
14
+
15
+ # google_gemma-3-1b-pt_qa_ds100_upsample1000
16
+
17
+ This model is a fine-tuned version of [google/gemma-3-1b-pt](https://huggingface.co/google/gemma-3-1b-pt) on an unknown dataset.
18
+
19
+ ## Model description
20
+
21
+ More information needed
22
+
23
+ ## Intended uses & limitations
24
+
25
+ More information needed
26
+
27
+ ## Training and evaluation data
28
+
29
+ More information needed
30
+
31
+ ## Training procedure
32
+
33
+ ### Training hyperparameters
34
+
35
+ The following hyperparameters were used during training:
36
+ - learning_rate: 2e-05
37
+ - train_batch_size: 32
38
+ - eval_batch_size: 8
39
+ - seed: 42
40
+ - distributed_type: multi-GPU
41
+ - num_devices: 4
42
+ - total_train_batch_size: 128
43
+ - total_eval_batch_size: 32
44
+ - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
45
+ - lr_scheduler_type: linear
46
+ - num_epochs: 1.0
47
+
48
+ ### Training results
49
+
50
+
51
+
52
+ ### Framework versions
53
+
54
+ - Transformers 4.51.3
55
+ - Pytorch 2.6.0+cu124
56
+ - Datasets 3.6.0
57
+ - Tokenizers 0.21.1
added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "<image_soft_token>": 262144
3
+ }
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Gemma3ForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "attn_logit_softcapping": null,
8
+ "bos_token_id": 2,
9
+ "cache_implementation": "hybrid",
10
+ "eos_token_id": 1,
11
+ "final_logit_softcapping": null,
12
+ "head_dim": 256,
13
+ "hidden_activation": "gelu_pytorch_tanh",
14
+ "hidden_size": 1152,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 6912,
17
+ "max_position_embeddings": 32768,
18
+ "model_type": "gemma3_text",
19
+ "num_attention_heads": 4,
20
+ "num_hidden_layers": 26,
21
+ "num_key_value_heads": 1,
22
+ "pad_token_id": 0,
23
+ "query_pre_attn_scalar": 256,
24
+ "rms_norm_eps": 1e-06,
25
+ "rope_local_base_freq": 10000,
26
+ "rope_scaling": null,
27
+ "rope_theta": 1000000,
28
+ "sliding_window": 512,
29
+ "sliding_window_pattern": 6,
30
+ "torch_dtype": "bfloat16",
31
+ "transformers_version": "4.51.3",
32
+ "use_cache": true,
33
+ "vocab_size": 262144
34
+ }
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 2,
3
+ "cache_implementation": "hybrid",
4
+ "do_sample": true,
5
+ "eos_token_id": [
6
+ 1,
7
+ 106
8
+ ],
9
+ "pad_token_id": 0,
10
+ "top_k": 64,
11
+ "top_p": 0.95,
12
+ "transformers_version": "4.51.3"
13
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:511a0ed61fa6c255b5cd08a9686978c84fa7008e2afe85cd984bfa4c0bd04209
3
+ size 1999811208
special_tokens_map.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "boi_token": "<start_of_image>",
3
+ "bos_token": {
4
+ "content": "<bos>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ "eoi_token": "<end_of_image>",
11
+ "eos_token": {
12
+ "content": "<eos>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false
17
+ },
18
+ "image_token": "<image_soft_token>",
19
+ "pad_token": {
20
+ "content": "<pad>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false
25
+ },
26
+ "unk_token": {
27
+ "content": "<unk>",
28
+ "lstrip": false,
29
+ "normalized": false,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ }
33
+ }
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1299c11d7cf632ef3b4e11937501358ada021bbdf7c47638d13c0ee982f2e79c
3
+ size 4689074
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff
 
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f7d90e12a70a8396ebe6274a9613d3a616d472980298bed0bcba56d9119149a5
3
+ size 5432