Argo Workflows

andrew.manning · October 10, 2022, 5:46pm

As we design the Calculation Engine, it is becoming clear that the primary role of the CE is to be a “workflow engine”. We have a set of MUSES modules that calculate some quantities based on inputs, and the outputs may or may not be used as inputs to subsequent modules. Typically this will involve only two modules: an EoS module and a User module that calculates some physical observables based on the EoS generated by the first module. But in general, any DAG (directed acyclic graph) may be desired. Workflow engines are a “solved problem” and there is even a popular solution from the Argo project:

Argo Workflows - The workflow engine for Kubernetes

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition).

Define workflows where each step in the workflow is a container.

Model multi-step workflows as a sequence of tasks or capture the dependencies between tasks using a directed acyclic graph (DAG).

Easily run compute intensive jobs for machine learning or data processing in a fraction of the time using Argo Workflows on Kubernetes.

Run CI/CD pipelines natively on Kubernetes without configuring complex software development products.

This looks amazing and may save us lots of work. One of the questions I immediately have is: Can Argo Workflows integrate remote job execution for portions of the workflow, for example if an EoS module runs on NCSA Delta to utilize GPUs but the subsequent User module runs on our Kubernetes cluster?

We may eventually determine that implementing our own custom workflow engine is the right approach, but we should investigate potential existing solutions first.

@cyberinfrastructure

andrew.manning · October 10, 2022, 9:33pm

We should also research Dask for workflow management:

Start on a laptop, but scale to a cluster, no matter what infrastructure you use. Dask deploys on Kubernetes, cloud, or HPC, and Dask libraries make it easy to use as much or as little compute as you need

ziyuan.z · October 12, 2022, 4:43am

Hi Andrew, the word “scale” appears many times in this Dask website. What does it mean?

andrew.manning · October 12, 2022, 1:51pm

They are describing the ability of the software to run on computing resources across a range of sizes without changing the source code or deployment configuration. This is also what we want for the CE. At the smallest scale, it should run on an individual laptop. At a larger scale, it should run in an HPC environment. In a larger scale than that, it should run on a system like Kubernetes that can dynamically run it across multiple “compute nodes” in parallel.

ziyuan.z · October 12, 2022, 3:53pm

Got it, thanks!!

andrew.manning · October 12, 2022, 7:19pm

At the Software Directorate all hands meeting today, someone presented work that involved the WDL: Workflow Description Language, which I had never heard of before but seems relevant to this topic.

The Workflow Description Language (WDL) is a way to specify data processing workflows with a human-readable and -writeable syntax. WDL makes it straightforward to define analysis tasks, chain them together in workflows, and parallelize their execution. The language makes common patterns simple to express, while also admitting uncommon or complicated behavior; and strives to achieve portability not only across execution platforms, but also different types of users. Whether one is an analyst, a programmer, an operator of a production system, or any other sort of user, WDL should be accessible and understandable.

rhaas · October 12, 2022, 9:41pm

I suspect I may have brought those up before but just to have it in here as well. There are a good number of workflow engines to choose from. Already for Blue Waters there was a whole series of engines that were presented on: Blue Waters User Portal | Scientific Workflows The interesting ones for us would be:

Parsl (co-developed locally by Dan Katz)
swift/T if we need very low latency
Makeflow
the generic overview Blue Waters User Portal | Overview of Scientific Workflows

Pegasus is too complex for our use (and may have too much latency).

andrew.manning · October 26, 2022, 9:15pm

This article looks remarkably relevant to our current specific vision of how the Calculation Engine will operate:

Parsl: Parallel Scripting in Python

The likelihood of someone else writing a “Provider” for Delta for us is a compelling reason to choose Parsl:

There is a simple LocalProvider for shared memory multiprocessors and a provider for Kubernetes in the cloud. In addition, there are special providers for a host of supercomputers including

Argonne’s Theta and Cooley

ORNL’s Summit

Open Science Grid Condor Clusters

University of Chicago Midway Cluster

TACC’s Frontera

NERSC’s Cori

SDSC’s Comet

NCSA’s Blue Waters

NSCC’s Aspire 1