Cluster resource utilization

As I mentioned weeks ago in our CI meeting notes topic, after the recent migration to a new Kubernetes cluster running K3s v1.30, we are no longer able to constrain the resources (CPU & RAM) consumed by the CE workloads. The technical details of why this is the case are fairly esoteric and so I’ll explain if anyone is interested, but the relevant issue is that the lack of resource constraints puts our computing cluster’s stability at risk when under heavy use.

Previously, we could place the limits on the task workers themselves (effectively at the relevant level of the compute node), which meant that we could set high concurrency for high throughput without sacrificing performance, since some tasks consume less CPU and some more, with a dynamic distribution of CPU use depending on workflows at the moment.

:+1: The good news is that I found a relatively painless way to recover resource limit enforcement by using the options supported by the Docker service that actually executes our MUSES module containers when workflows are executed.

:-1: The bad news is that moving the resource limit enforcement down to the individual workflow task level means that we are forced to balance task throughput against task performance for each module.

The diagram below illustrates the issue:

In this example, we run a single task worker on each cluster node, and we allow that task worker to run up to 2 concurrent tasks (i.e. MUSES modules). Since each cluster node has 16 cpu, then we can set a limit of 8 cpu per task. Thus, since our cluster has 10 worker nodes, we could run up to 20 workflow tasks in parallel.

The diagram below depicts a different balance:

Here, we double the task throughput to 40 concurrent tasks, but at the expense of a maximum of 4 CPUs available to each task.

We know that some MUSES modules are designed to use no more than 1 CPU while others are written to run multicore. It is possible to improve the cluster utilization by assigning certain module tasks to specially configured task workers. For example, perhaps 6 of the 10 task workers (queue “A”) are configured to run with concurrency 16 and CPU limit 1, so that for low-cpu modules, these workers are high-throughput. 2 of the 10 workers (queue “B”) are configured to have concurrency 4 with CPU limit 4, and the remaining 2 of the 10 workers (queue “C”) are configured to have concurrency 2 with CPU limit 8. This kind of multi-queue configuration would require adding more information to the CE config for each registered MUSES module: namely, the max CPU and memory the module expects to use.

For now I plan to implement the new limit enforcer and deploy as soon as I can to hopefully ensure cluster stability by the time we have our MUSES meeting and more people are using the service. @jakinh We should discuss this issue at the meeting and decide how to proceed.

Based on discussion with @mrpelicer, I implemented the following constraints: 4 CPU per module run (i.e. workflow process, a.k.a. Celery task), 3 concurrent tasks per node (so a max of 21 concurrent tasks given our 7 task workers); no memory limits. If we find that system load exceeds our 64GiB RAM/node capacity and crashes nodes, we will have to enforce memory limits, but I’m willing to see what happens in practice first.

Now it is your turn @devs: I want to see so many workflows processing that we see where this thing breaks! Before the MUSES meeting.

Good news: I eventually devised a solution that recovers our original ability to limit resources per task worker! That means that we no longer have to arbitrarily set per-workflow-task limits; instead, up to 12 concurrent MUSES modules can run on each node, and their total CPU/MEM usage is capped. Thus if two tasks need 4 cpu but the others need <1, then the cpu-hungry tasks will likely be able to eat their fill.

In that example, I said two cpu-hungry tasks, but since currently the tasks are distributed randomly to workers, there could easily be 12 cpu-hungry tasks trying to feed out of the same 12-cpu-limit trough. Hence, we may still decide to create separate “high performance” (HPC) and “high throughput” (HTC) queues, where some of the Celery workers have lower concurrency but guarantee more CPU per task (HPC) and others have higher concurrency for low-CPU tasks (HTC). This is something I hope we can discuss at the meeting next week and with some more real-world use feedback. @jakinh

1 Like