As we design the Calculation Engine, it is becoming clear that the primary role of the CE is to be a “workflow engine”. We have a set of MUSES modules that calculate some quantities based on inputs, and the outputs may or may not be used as inputs to subsequent modules. Typically this will involve only two modules: an EoS module and a User module that calculates some physical observables based on the EoS generated by the first module. But in general, any DAG (directed acyclic graph) may be desired. Workflow engines are a “solved problem” and there is even a popular solution from the Argo project:
Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition).
Define workflows where each step in the workflow is a container.
Model multi-step workflows as a sequence of tasks or capture the dependencies between tasks using a directed acyclic graph (DAG).
Easily run compute intensive jobs for machine learning or data processing in a fraction of the time using Argo Workflows on Kubernetes.
Run CI/CD pipelines natively on Kubernetes without configuring complex software development products.
This looks amazing and may save us lots of work. One of the questions I immediately have is: Can Argo Workflows integrate remote job execution for portions of the workflow, for example if an EoS module runs on NCSA Delta to utilize GPUs but the subsequent User module runs on our Kubernetes cluster?
We may eventually determine that implementing our own custom workflow engine is the right approach, but we should investigate potential existing solutions first.
We should also research Dask for workflow management:
Start on a laptop, but scale to a cluster, no matter what infrastructure you use. Dask deploys on Kubernetes, cloud, or HPC, and Dask libraries make it easy to use as much or as little compute as you need
They are describing the ability of the software to run on computing resources across a range of sizes without changing the source code or deployment configuration. This is also what we want for the CE. At the smallest scale, it should run on an individual laptop. At a larger scale, it should run in an HPC environment. In a larger scale than that, it should run on a system like Kubernetes that can dynamically run it across multiple “compute nodes” in parallel.
At the Software Directorate all hands meeting today, someone presented work that involved the WDL: Workflow Description Language, which I had never heard of before but seems relevant to this topic.
The Workflow Description Language (WDL) is a way to specify data processing workflows with a human-readable and -writeable syntax. WDL makes it straightforward to define analysis tasks, chain them together in workflows, and parallelize their execution. The language makes common patterns simple to express, while also admitting uncommon or complicated behavior; and strives to achieve portability not only across execution platforms, but also different types of users. Whether one is an analyst, a programmer, an operator of a production system, or any other sort of user, WDL should be accessible and understandable.
I suspect I may have brought those up before but just to have it in here as well. There are a good number of workflow engines to choose from. Already for Blue Waters there was a whole series of engines that were presented on: Blue Waters User Portal | Scientific Workflows The interesting ones for us would be:
The likelihood of someone else writing a “Provider” for Delta for us is a compelling reason to choose Parsl:
There is a simple LocalProvider for shared memory multiprocessors and a provider for Kubernetes in the cloud. In addition, there are special providers for a host of supercomputers including