Provenance recording

andrew.manning · January 13, 2022, 4:17pm

The MUSES proposal explicitly mentions the need for a system to record the provenance of data and results. There are many motivations for such a system, and we are not the first people to tackle the problem.

Provenance: This is the main bookkeeping component of the CI, whose purpose is to record all information that may be useful for service usage analytics, system diagnostics, auditing, and reproducibility of scientific results. This information may include user activity, workflows executed, models evaluated, inputs and outputs and details of computational jobs, with the goal to ensure that computations can be uniquely identified and reproduced. Provenance of computational results is a complex technical problem within modern science, yet it is a critical part of publishing results for the sake of scientific reproducibility. Access to analytic and diagnostic data will only be accessible internally through the CI.

NCSA has been instrumental in developing a sophisticated system for provenance recording with the goal of scientific result reproducibility. They founded the National Data Service, resulting in the NDS Workbench used for example in the NSF EarthCube program.

The National Data Service (NDS) is an emerging vision for how scientists and researchers across all disciplines can find, reuse, and publish data. It builds on the data archiving and sharing efforts already underway within specific communities and links them together with a common set of tools…

Another project spawned from NDS is Whole Tale that I think we may be able to leverage for MUSES.

What is Whole Tale?

Whole Tale is an NSF-funded Data Infrastructure Building Block (DIBBS) initiative to build a scalable, open source, web-based, multi-user platform for reproducible research enabling the creation, publication, and execution of tales - executable research objects that capture data, code, and the complete software environment used to produce research findings.

A beta version of the system is available at https://dashboard.wholetale.org.

Why Whole Tale?

Virtually all published discoveries today have data and computational components. There is a mismatch between traditional scientific dissemination practices and modern computational research practice that leads to reproducibility concerns. The Whole Tale platform supports computational reproducibility by enabling researchers to create and package code, data and information about the workflow and computational environment necessary to support review and reproduce results of computational analysis that are reported in published research. Whole Tale implements this definition by supporting explicit citation of externally referenced data, capturing the artifacts and provenance information needed to facilitate understanding, transparency, and execution of the computational processes and workflows used for review and reproducibility at the time of publication.

I recently contacted a member of the Whole Tale development team to start a discussion with him about this topic.