Runs on same node seem to block each other
Some runs (if shorly started on the same node seem to block each other
This seems to happen even before the output-dir is created! The only things done before that are adding args and parsing them, as well as wandb.init(entity="explainable-nlp", project="expred")
I suspect it a problem of wandb and will file an issue.
condor_q --nobatch
:
-- Schedd: deken1.local : <192.168.1.80:9618?... @ 04/09/21 10:25:04
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
22315.0 mreimer 4/8 13:32 0+20:46:59 R 0 318.0 run_with_env.sh expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies_0.3_23
22316.0 mreimer 4/8 13:32 0+20:46:59 R 0 318.0 run_with_env.sh expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies_0.3_44
22317.0 mreimer 4/8 13:32 0+20:46:59 R 0 318.0 run_with_env.sh expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies_0.4_24
22318.0 mreimer 4/8 13:32 0+20:46:59 R 0 318.0 run_with_env.sh expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies_0.4_45
When connection via condor_ssh_to_job
and running ps -fu mreimer --sort cmd
you get: (all on node01)
UID PID PPID C STIME TTY TIME CMD
mreimer 982070 982054 0 Apr08 ? 00:00:00 /bin/bash /var/lib/condor/execute/dir_982054/condor_exec.exe expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies
mreimer 982071 982055 0 Apr08 ? 00:00:00 /bin/bash /var/lib/condor/execute/dir_982055/condor_exec.exe expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies
mreimer 982078 982056 0 Apr08 ? 00:00:00 /bin/bash /var/lib/condor/execute/dir_982056/condor_exec.exe expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies
mreimer 982082 982059 0 Apr08 ? 00:00:00 /bin/bash /var/lib/condor/execute/dir_982059/condor_exec.exe expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies
mreimer 982152 982128 0 Apr08 ? 00:00:00 git cat-file --batch-check
mreimer 982164 982127 0 Apr08 ? 00:00:00 git cat-file --batch-check
mreimer 982165 982124 0 Apr08 ? 00:00:00 git cat-file --batch-check
mreimer 982171 982126 0 Apr08 ? 00:00:00 git cat-file --batch-check
mreimer 1051119 1050833 0 10:30 pts/0 00:00:00 ps -fu mreimer --sort cmd
mreimer 982075 982072 0 Apr08 ? 00:00:02 python expred/train.py --seed 100 --data_dir /home/mreimer/datasets/eraser/movies_0.3_23 --output_dir outputs/movies_0.3_23/21_08_04_13_38
mreimer 982077 982074 0 Apr08 ? 00:00:02 python expred/train.py --seed 100 --data_dir /home/mreimer/datasets/eraser/movies_0.3_44 --output_dir outputs/movies_0.3_44/21_08_04_13_38
mreimer 982081 982079 0 Apr08 ? 00:00:02 python expred/train.py --seed 100 --data_dir /home/mreimer/datasets/eraser/movies_0.4_24 --output_dir outputs/movies_0.4_24/21_08_04_13_38
mreimer 982085 982083 0 Apr08 ? 00:00:02 python expred/train.py --seed 100 --data_dir /home/mreimer/datasets/eraser/movies_0.4_45 --output_dir outputs/movies_0.4_45/21_08_04_13_38
mreimer 982107 982075 0 Apr08 ? 00:00:00 /home/mreimer/envs/expred/bin/python -c from multiprocessing.semaphore_tracker import main;main(3)
mreimer 982108 982081 0 Apr08 ? 00:00:00 /home/mreimer/envs/expred/bin/python -c from multiprocessing.semaphore_tracker import main;main(3)
mreimer 982109 982077 0 Apr08 ? 00:00:00 /home/mreimer/envs/expred/bin/python -c from multiprocessing.semaphore_tracker import main;main(3)
mreimer 982110 982085 0 Apr08 ? 00:00:00 /home/mreimer/envs/expred/bin/python -c from multiprocessing.semaphore_tracker import main;main(3)
mreimer 982124 982085 0 Apr08 ? 00:00:01 /home/mreimer/envs/expred/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=4, pipe_handle=11) --multiproc
mreimer 982126 982075 0 Apr08 ? 00:00:01 /home/mreimer/envs/expred/bin/python -c from multiprocessing.spawn imporps -fu mreimer --sort cmdt spawn_main; spawn_main(tracker_fd=4, pipe_handle=11) --multiproc
mreimer 982127 982081 0 Apr08 ? 00:00:01 /home/mreimer/envs/expred/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=4, pipe_handle=11) --multiproc
mreimer 982128 982077 0 Apr08 ? 00:00:01 /home/mreimer/envs/expred/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=4, pipe_handle=11) --multiproc
mreimer 982072 982070 0 Apr08 ? 00:00:00 /bin/bash scripts/run_movies_from_config.sh movies_0.3_23
mreimer 982074 982071 0 Apr08 ? 00:00:00 /bin/bash scripts/run_movies_from_config.sh movies_0.3_44
mreimer 982079 982078 0 Apr08 ? 00:00:00 /bin/bash scripts/run_movies_from_config.sh movies_0.4_24
mreimer 982083 982082 0 Apr08 ? 00:00:00 /bin/bash scripts/run_movies_from_config.sh movies_0.4_45
``
Edited by Maximilian Reimer