JesseLiu's picture
init
aac6c3b
nohup: ignoring input
W1030 19:02:35.769000 8870 site-packages/torch/distributed/run.py:793]
W1030 19:02:35.769000 8870 site-packages/torch/distributed/run.py:793] *****************************************
W1030 19:02:35.769000 8870 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1030 19:02:35.769000 8870 site-packages/torch/distributed/run.py:793] *****************************************
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory
/root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory
E1030 19:02:35.887000 8870 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 2) local_rank: 0 (pid: 8935) of binary: /root/miniconda3/envs/meditron/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/meditron/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tune_combined.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-10-30_19:02:35
host : 3f2e085cdcba
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 8936)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-10-30_19:02:35
host : 3f2e085cdcba
rank : 2 (local_rank: 2)
exitcode : 2 (pid: 8937)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-10-30_19:02:35
host : 3f2e085cdcba
rank : 3 (local_rank: 3)
exitcode : 2 (pid: 8938)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-10-30_19:02:35
host : 3f2e085cdcba
rank : 4 (local_rank: 4)
exitcode : 2 (pid: 8939)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-10-30_19:02:35
host : 3f2e085cdcba
rank : 5 (local_rank: 5)
exitcode : 2 (pid: 8940)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2024-10-30_19:02:35
host : 3f2e085cdcba
rank : 6 (local_rank: 6)
exitcode : 2 (pid: 8941)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-30_19:02:35
host : 3f2e085cdcba
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 8935)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================