There appears to be a need for a curated list of data files that people can use in shared workflow components. The specific issue at the moment is the use of EoS tables in this list from the Crust DFT documentation. We need a way for people to efficiently reference the same file for the same purpose without having to upload it again (supported) while also enabling provenance/reproducibility of results (not yet supported).
Here’s an idea for a first iteration of such a shared data file system: A curated list of shared input data files will be uploaded (owned by the admin user so no one accidentally deletes them). The “name”, a description, and most importantly the checksum of the file (e.g. md5sum) will be registered on some central document, perhaps initially a wiki-post on the forum. I can modify the CE such that when you specify an uploaded file as a process input you must include that file’s checksum, which the CE would validate prior to executing the process in the workflow, and failing if the checksum mismatches. The checksum is thus the immutable piece of information required for reproducibility, allowing the file to be reuploaded (and given a new UUID) as dictated by operational needs without making it impossible to verify a previous workflow result.
Clearly this system will need revising but we need something that works first and then we can worry about what works well. @jakinh I need the scientists to respond with comments and suggestions based on how you plan to use the CE.
By the way, @sroy14 , four of these links are dead (as of 2024/10/07 15:27 UTC):
I tried running crust-DFT + Lepton + QLIMR with all the input files that do not contain ‘no lepton’, but all the Crust DFT outputs I’m getting are identical. I don’t think this is expected. Nevermind this. I ran it again and it miraculously worked.
From my conversations with @sroy14, I think the ‘No Lepton’ tables are not used for MUSES.
Yes, that is correct. The files with leptons have both data with and without leptons in them so they are for now sufficient when used with inc_lep false flag.
So…where can we find this file? Also, this morning we discussed a need for version control for these files. On our side we can calculate the checksum for each file and include that in the workflow spec to ensure that the file is immutable for the sake of reproducibility. But on your side it would be best if you could publish these files to something like Zenodo so there is an immutable, version-controlled, and DOI-citeable copy.
@sroy14 Did I understand you correctly that these files are no longer necessary for Crust DFT to function because the module is now calculating the values instead of interpolating from these tables?
You may want to update that Zenodo record’s description to include more detail about those specific files, giving some immediate general information about what they are, but also referencing immutable content. As it is, the “Full documentation at https://np3m.org/code/e4mma” is evolving. Consider linking to this page for example, so it is at least pinned at a commit coinciding with the data files.
If you look at the table above, the upload on ce.musesframework.io with that UUID has md5sum 6d6444c2ac7fc88656d41ae846aa3fe4. My admin view of the uploads confirms this.
@mrpelicer There must have been a transcription error. I reconstructed the upload table above by directly copying and pasting text from the CE interface. It looks like the “small_sl, small_r, smaller_r” file checksums had been rotated but should work now.
Pro tip: Use the table expand button see read tables like this more easily: