|
nohup: ignoring input |
|
W1030 19:02:35.769000 8870 site-packages/torch/distributed/run.py:793] |
|
W1030 19:02:35.769000 8870 site-packages/torch/distributed/run.py:793] ***************************************** |
|
W1030 19:02:35.769000 8870 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W1030 19:02:35.769000 8870 site-packages/torch/distributed/run.py:793] ***************************************** |
|
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
|
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
|
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
|
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
|
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
|
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
|
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
|
E1030 19:02:35.887000 8870 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 2) local_rank: 0 (pid: 8935) of binary: /root/miniconda3/envs/meditron/bin/python |
|
Traceback (most recent call last): |
|
File "/root/miniconda3/envs/meditron/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper |
|
return f(*args, **kwargs) |
|
File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main |
|
run(args) |
|
File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run |
|
elastic_launch( |
|
File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
tune_combined.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
[1]: |
|
time : 2024-10-30_19:02:35 |
|
host : 3f2e085cdcba |
|
rank : 1 (local_rank: 1) |
|
exitcode : 2 (pid: 8936) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[2]: |
|
time : 2024-10-30_19:02:35 |
|
host : 3f2e085cdcba |
|
rank : 2 (local_rank: 2) |
|
exitcode : 2 (pid: 8937) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[3]: |
|
time : 2024-10-30_19:02:35 |
|
host : 3f2e085cdcba |
|
rank : 3 (local_rank: 3) |
|
exitcode : 2 (pid: 8938) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[4]: |
|
time : 2024-10-30_19:02:35 |
|
host : 3f2e085cdcba |
|
rank : 4 (local_rank: 4) |
|
exitcode : 2 (pid: 8939) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[5]: |
|
time : 2024-10-30_19:02:35 |
|
host : 3f2e085cdcba |
|
rank : 5 (local_rank: 5) |
|
exitcode : 2 (pid: 8940) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
[6]: |
|
time : 2024-10-30_19:02:35 |
|
host : 3f2e085cdcba |
|
rank : 6 (local_rank: 6) |
|
exitcode : 2 (pid: 8941) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-10-30_19:02:35 |
|
host : 3f2e085cdcba |
|
rank : 0 (local_rank: 0) |
|
exitcode : 2 (pid: 8935) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
|