expred issueshttps://git.l3s.uni-hannover.de/mreimer/expred/-/issues2021-04-13T12:26:53+02:00https://git.l3s.uni-hannover.de/mreimer/expred/-/issues/8Runs on same node seem to block each other2021-04-13T12:26:53+02:00Maximilian ReimerRuns on same node seem to block each otherSome runs (if shorly started on the same node seem to block each other
This seems to happen even before the output-dir is created! The only things done before that are adding args and parsing them, as well as ` wandb.init(entity="expla...Some runs (if shorly started on the same node seem to block each other
This seems to happen even before the output-dir is created! The only things done before that are adding args and parsing them, as well as ` wandb.init(entity="explainable-nlp", project="expred")` I suspect it a problem of wandb and will file an issue.
`condor_q --nobatch`:
```
-- Schedd: deken1.local : <192.168.1.80:9618?... @ 04/09/21 10:25:04
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
22315.0 mreimer 4/8 13:32 0+20:46:59 R 0 318.0 run_with_env.sh expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies_0.3_23
22316.0 mreimer 4/8 13:32 0+20:46:59 R 0 318.0 run_with_env.sh expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies_0.3_44
22317.0 mreimer 4/8 13:32 0+20:46:59 R 0 318.0 run_with_env.sh expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies_0.4_24
22318.0 mreimer 4/8 13:32 0+20:46:59 R 0 318.0 run_with_env.sh expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies_0.4_45
```
When connection via `condor_ssh_to_job` and running `ps -fu mreimer --sort cmd` you get: (all on node01)
```
UID PID PPID C STIME TTY TIME CMD
mreimer 982070 982054 0 Apr08 ? 00:00:00 /bin/bash /var/lib/condor/execute/dir_982054/condor_exec.exe expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies
mreimer 982071 982055 0 Apr08 ? 00:00:00 /bin/bash /var/lib/condor/execute/dir_982055/condor_exec.exe expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies
mreimer 982078 982056 0 Apr08 ? 00:00:00 /bin/bash /var/lib/condor/execute/dir_982056/condor_exec.exe expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies
mreimer 982082 982059 0 Apr08 ? 00:00:00 /bin/bash /var/lib/condor/execute/dir_982059/condor_exec.exe expred /home/mreimer/projects/expred scripts/run_movies_from_config.sh movies
mreimer 982152 982128 0 Apr08 ? 00:00:00 git cat-file --batch-check
mreimer 982164 982127 0 Apr08 ? 00:00:00 git cat-file --batch-check
mreimer 982165 982124 0 Apr08 ? 00:00:00 git cat-file --batch-check
mreimer 982171 982126 0 Apr08 ? 00:00:00 git cat-file --batch-check
mreimer 1051119 1050833 0 10:30 pts/0 00:00:00 ps -fu mreimer --sort cmd
mreimer 982075 982072 0 Apr08 ? 00:00:02 python expred/train.py --seed 100 --data_dir /home/mreimer/datasets/eraser/movies_0.3_23 --output_dir outputs/movies_0.3_23/21_08_04_13_38
mreimer 982077 982074 0 Apr08 ? 00:00:02 python expred/train.py --seed 100 --data_dir /home/mreimer/datasets/eraser/movies_0.3_44 --output_dir outputs/movies_0.3_44/21_08_04_13_38
mreimer 982081 982079 0 Apr08 ? 00:00:02 python expred/train.py --seed 100 --data_dir /home/mreimer/datasets/eraser/movies_0.4_24 --output_dir outputs/movies_0.4_24/21_08_04_13_38
mreimer 982085 982083 0 Apr08 ? 00:00:02 python expred/train.py --seed 100 --data_dir /home/mreimer/datasets/eraser/movies_0.4_45 --output_dir outputs/movies_0.4_45/21_08_04_13_38
mreimer 982107 982075 0 Apr08 ? 00:00:00 /home/mreimer/envs/expred/bin/python -c from multiprocessing.semaphore_tracker import main;main(3)
mreimer 982108 982081 0 Apr08 ? 00:00:00 /home/mreimer/envs/expred/bin/python -c from multiprocessing.semaphore_tracker import main;main(3)
mreimer 982109 982077 0 Apr08 ? 00:00:00 /home/mreimer/envs/expred/bin/python -c from multiprocessing.semaphore_tracker import main;main(3)
mreimer 982110 982085 0 Apr08 ? 00:00:00 /home/mreimer/envs/expred/bin/python -c from multiprocessing.semaphore_tracker import main;main(3)
mreimer 982124 982085 0 Apr08 ? 00:00:01 /home/mreimer/envs/expred/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=4, pipe_handle=11) --multiproc
mreimer 982126 982075 0 Apr08 ? 00:00:01 /home/mreimer/envs/expred/bin/python -c from multiprocessing.spawn imporps -fu mreimer --sort cmdt spawn_main; spawn_main(tracker_fd=4, pipe_handle=11) --multiproc
mreimer 982127 982081 0 Apr08 ? 00:00:01 /home/mreimer/envs/expred/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=4, pipe_handle=11) --multiproc
mreimer 982128 982077 0 Apr08 ? 00:00:01 /home/mreimer/envs/expred/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=4, pipe_handle=11) --multiproc
mreimer 982072 982070 0 Apr08 ? 00:00:00 /bin/bash scripts/run_movies_from_config.sh movies_0.3_23
mreimer 982074 982071 0 Apr08 ? 00:00:00 /bin/bash scripts/run_movies_from_config.sh movies_0.3_44
mreimer 982079 982078 0 Apr08 ? 00:00:00 /bin/bash scripts/run_movies_from_config.sh movies_0.4_24
mreimer 982083 982082 0 Apr08 ? 00:00:00 /bin/bash scripts/run_movies_from_config.sh movies_0.4_45
``https://git.l3s.uni-hannover.de/mreimer/expred/-/issues/5Add sanity checks2021-04-08T15:58:36+02:00Maximilian ReimerAdd sanity checksI thing two things would be great:
1. Simple sanity checks (unit tests) that run after every push in the pipeline e.g. a dummy dataset with just on batch of examples. That should also run in the ci. Just to check al stages are working c...I thing two things would be great:
1. Simple sanity checks (unit tests) that run after every push in the pipeline e.g. a dummy dataset with just on batch of examples. That should also run in the ci. Just to check al stages are working correctly
2. Some units tests we run every week or so to check that the model is still performing well on a know yet small dataset (e.g. movies)https://git.l3s.uni-hannover.de/mreimer/expred/-/issues/7Switch to pytorch lighting training loop2021-04-08T16:00:33+02:00Maximilian ReimerSwitch to pytorch lighting training loopWith [Iseue 4](https://git.l3s.uni-hannover.de/mreimer/expred/-/issues/4) done, we could easily switch to pytorch lightning to make the code structure cleaner and more robust.With [Iseue 4](https://git.l3s.uni-hannover.de/mreimer/expred/-/issues/4) done, we could easily switch to pytorch lightning to make the code structure cleaner and more robust.https://git.l3s.uni-hannover.de/mreimer/expred/-/issues/4Refactor loading of data to use a data loader2021-04-08T16:00:33+02:00Maximilian ReimerRefactor loading of data to use a data loaderI would suggest wrapping the data loading into a pytorch data loader and just a sampler for shuffling and batching.
One needs to
1. Implement a dataset that takes care of the loading tokenization and returns instances. I think a map-s...I would suggest wrapping the data loading into a pytorch data loader and just a sampler for shuffling and batching.
One needs to
1. Implement a dataset that takes care of the loading tokenization and returns instances. I think a map-style dataset that returns dict/object (like `SentenceEvidence`)
2. Implement a `collate_fn` that takes a list of these object and returns a batch
3. Then one can just use:
```python
def collate_and_padd_batch(...):
...
dataset = EraserDataset(..., split='train')
loader = DataLoader(dataset, batch_size=1, shuffle=True, num_workers=0, collate_fn=collate_and_padd_batch)
for batch in loader:
# to training
...
```
For more information [see](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)https://git.l3s.uni-hannover.de/mreimer/expred/-/issues/9When resuming training the optimizer can not be restored correctly.2021-04-09T16:12:06+02:00Maximilian ReimerWhen resuming training the optimizer can not be restored correctly.When resuming training the optimizer can not be restored correctly. This leads to an error, because the initial learning rate is not set.When resuming training the optimizer can not be restored correctly. This leads to an error, because the initial learning rate is not set.https://git.l3s.uni-hannover.de/mreimer/expred/-/issues/10Cashed runs on Fever2021-04-13T12:26:53+02:00Maximilian ReimerCashed runs on FeverWhen running the experiments of different amounts of rations on fever quite a lot runs simply crashed.
There were tree kinds of crashes:
1. Not in wandb listed, and stuck due to [Issue 8](https://git.l3s.uni-hannover.de/mreimer/expred...When running the experiments of different amounts of rations on fever quite a lot runs simply crashed.
There were tree kinds of crashes:
1. Not in wandb listed, and stuck due to [Issue 8](https://git.l3s.uni-hannover.de/mreimer/expred/-/issues/8) (solved)
2. Listed in wandb as crashed, maybe just an wanb issue?
3. Not in wandb listed (seems to just not be started) -> run again
**To 2.**
| Name | State | Hostname | identifier.epoch | classifier.epoch | output_dir | scored | cls epochs in log | scores_uploaded |
|---------------------- |--------- |---------- |------------------ |------------------ |-------------------------------------------- |-------- |------------------- |----------------- |
| warm-puddle-152 | crashed | node03 | 9 | 1 | outputs/fever_0.7_27/100/21_09_04_18_14_07 | ✅ | 1 | ✅ |
| breezy-elevator-156 | crashed | node03 | 9 | 1 | outputs/fever_0.5_46/100/21_09_04_18_14_07 | ✅ | 1 | ✅ |
| worthy-water-151 | crashed | node04 | 9 | 1 | outputs/fever_0.7_48/100/21_09_04_18_14_07 | ✅ | 1 | ✅ |
| grateful-silence-212 | crashed | node02 | 5 | | outputs/fever_0.8_49/200/21_09_04_18_22_58 | - | 0 | - |
| stoic-deluge-160 | crashed | node05 | 9 | 0 | outputs/fever_0.9_50/100/21_09_04_18_58_58 | ✅ | 1 | ✅ |
| misty-shape-145 | crashed | node06 | 9 | 1 | outputs/fever/100/21_09_04_18_13_48 | ✅ | 1 | ✅ |
3. Not investigated yet
Missing runs (all on fever)
| dataset_meta.rationals_fraction | dataset_meta.generation_seed | Cause |
|----------------------------------|------------------------------|-------|
| 0.2 | 43 | not creates at all? (no output dir) |
| 0.4 | 45 | not creates at all? (no output dir) |
| 0.6 | 47 | not creates at all? (no output dir) |
| 0.8 | 28 | not creates at all? (no output dir) |
and fever on seed (Maximilian ReimerMaximilian Reimer