GeoEDF workflow engine and science gateway

andrew.manning · November 10, 2023, 9:35pm

Overview

The GeoEDF project is very similar to MUSES from a workflow engine perspective. We should learn about their framework and deployment methods to inform our alpha release.

GeoEDF is an extensible data framework designed to simplify data wrangling in geospatial research workflows. GeoEDF enables researchers to define scientific workflows as a logical sequence of data acquisition and processing steps.

GeoEDF: A Framework for Designing and Executing Reproducible Geospatial Research Workflows in Science Gateways: This presentation has slides about how the scientific community can contribute “connectors and processors” (in our case, “modules”) by following their framework standards and opening a pull request to the relevant repos. The CI/CD tools in GitHub builds the images and the workflow engine (in our case “Calculation Engine”) registers these new containers to support workflows that would use them.

Workflows

Their workflow syntax is similar to our concept, where a job sequence is defined in a YAML-formatted file. And similar to what we have discussed as a minimal interface, their users invoke functions from a published Python API in hosted Jupyter notebooks:

    from geoedfengine.WorkflowEngine import WorkflowEngine
    WorkflowEngine.execute_workflow(<file-path>,'<workflow-name>')

Under the hood, they employ Pegasus to actually plan and execute jobs on the target execution site (Purdue University’s Halstead cluster). A key difference from MUSES is the fact that our workflows are not supplied by users and dynamically constructed; they are a prix fixe menu, which makes things simpler.

Instead of deploying their workflow engine themselves as we plan to do, they leverage the MyGeoHub science gateway which is itself powered by HubZero. The NCSA Delta advanced computing and data resource runs a HubZero instance as well. Delta “partners with The Science Gateways Community Institute (SGCI) in delivering its vision for a modern computing and data resource that is usable and accessible to the broadest possible community of users”.

Deployment

GeoEDF offers a standalone deployment method consisting of a single container running a customized JupyterLab image that includes HTCondor, Pegasus, and their workflow engine. “The standalone deployment is based on the Zero to JupyterHub approach to deploying JupyterHub in a Kubernetes cluster.”

Thus they devised a solution to the same problem we face: what methods do we offer to researchers to run their calculations? A hosted service and a local Docker-based option.