Overflow of job scratch space

Continuing the discussion from Alpha Release: Bugs, Feedback, Ideas, and Improvements!:

Currently there are only 27 jobs consuming all 10GB of the scratch space allocated for the purpose. This is more output per job on average than we anticipated.

We need to do a few things to fix this problem, but given that I am abroad at a conference, a solution may not be installed until next week.

  • Implement proper deletion of job scratch files upon job completion. This is trickier in the case of failed or aborted jobs.

  • Expand the persistent volume storing the scratch files. We could double (or triple) the size of the volume; larger than that will be a larger effort, as I will need to expand the OpenStack volumes attached to our cluster nodes that host the persistent volumes used by applications like the CE.

I already started implementing a solution for the post-job deletion, which should solve 99% of the problem. Check here and the linked issue to monitor progress. (And if you feel like helping, dive in! This is a good bite-sized CE development task that can be debugged locally.)

Given the severity of the issue and the fact that it cannot get much worse than being unusable as is :laughing: , I pushed my hotfix with minimal testing. CE v0.5.11 is deployed and appears to be deleting scratch folders upon completion (at least successful workflow completions). The volume has been expanded to 20GiB over the original 10GiB.

Please post here if you encounter more errors related to this.

The results of the test I just ran indicate that, at least for successful jobs, scratch files are being deleted. I used the job cannon to launch five workflows of different types on alpha.musesframework.io:

$ date ; python test/job_cannon.py 1 5; date
Fri Sep 20 10:41:02 PM CEST 2024
[22:41:03] Process started: test_cmf-0
[22:41:03] Process started: test_eos_taylor_4d-0
[22:41:03] Process started: test_eos_ising_texs_2d-0
[22:41:03] Process started: test_lepton-0
[22:41:03] Process started: test_holographic_eos-0
...
[22:41:13] Process finished: test_lepton-0
[22:41:23] Process finished: test_cmf-0
[22:41:23] Process finished: test_eos_taylor_4d-0
[22:41:23] Process finished: test_holographic_eos-0
[22:41:33] Process finished: test_eos_ising_texs_2d-0
Fri Sep 20 10:41:33 PM CEST 2024

During the workflow execution (where the ellipses is between 22:41:03 and 22:41:13), I listed the scratch directory, where the job UUIDs are visible:

$ kubectl exec -n ce-alpha celery-worker-0 -c worker -- bash -c 'ls -la /scratch'
total 28
drwxrwxrwx 7 ce   ce   4096 Sep 20 20:41 .
drwxr-xr-x 1 root root 4096 Sep 19 08:23 ..
drwxr-xr-x 3 ce   ce   4096 Sep 20 20:41 2d00b6d9-de67-40f3-847f-62024cf82827
drwxr-xr-x 3 ce   ce   4096 Sep 20 20:41 4387f565-54c9-4a0a-9d45-115aacc8a117
drwxr-xr-x 3 ce   ce   4096 Sep 20 20:41 8631cbc5-0192-4e5d-ad0b-94de75aff898
drwxr-xr-x 3 ce   ce   4096 Sep 20 20:41 eb79c4a3-990a-4f88-9de3-fafbd8d03227
drwxr-xr-x 3 ce   ce   4096 Sep 20 20:41 ec9311a9-a89a-4081-9e87-2ab902654a10

After the completion of the jobs, the list is empty:

$ kubectl exec -n ce-alpha celery-worker-0 -c worker -- bash -c 'ls -la /scratch'
total 8
drwxrwxrwx 2 ce   ce   4096 Sep 20 20:41 .
drwxr-xr-x 1 root root 4096 Sep 19 08:23 ..