Tying Software Provenance into the QA System

I’m working with @afausti and the rest of @SQuaRE on the SQuaSH QA system. A compelling use-case for QA is to provide developers with feedback on how their commits affect science quality metrics in much the same way as the CI system validates the code itself.

To make this great, we’ll want to tie each QA run (e.g., validate_drp running on the output of processCCD) to the exact version composition of the Stack. This way we could compare metric time series for 'master’ against ticket branch development. Going further, we’ll want to tie QA runs to individual commits on a ticket branch so that a developer can understand exactly how their code changes influence science metrics.

Rather than capturing provenance ourselves, we should certainly adopt LSST’s provenance database. @jbecla has posted the provenance design at https://github.com/lsst-dm/provenance_proto/blob/tickets/DM-3962/Provenance.md however I don’t fully understand how software provenance works here.

My questions/needs:

  1. How is Stack software provenance stored? We’d like to know:
  • Package names
  • GitHub URLs for the authoritative package repositories
  • Git commit of each software repository used in the stack
  • The branch name that commit resolves to (although that’s easy to compute)
  1. Is the provenance system such that every QA run will be registered in the provenance DB?
  2. We’ll need an identifier to tie the QA run (a Jenkins job number) to records in the QA DB.

For completeness, these discussions are related:

See also RFC-169 which talks about an interim solution from HSC.

It’s probably also worth following DM-3372, where @price is in the process of bringing over the HSC persistence while replacing it’s ugliest parts.

The way I think things will work once DM-3372 is complete is that the list of packages and versions will be saved as a Butler data product (just a text file of some kind) within the output repository. On the HSC side, those names and versions are generated from the list of setup EUPS products, but the hope is to change that to instead look at the version.py modules generated by sconsUtils (these encode Git versions of a package as of the last time SCons was run), along with some custom version-inspection code for important third-party package.

Right now, this is completely separate from the provenance proposal @jbecla put together, and it’s not clear to me yet how to connect that rather high-level, long-term vision with the stack we have today.

So none of the HSC provenance work includes a stand-alone provenance DB with a RESTful API, is that right? I think our hope was to avoid putting extra tables and logic into our QA dashboard application.

We can probably ship the QA dashboard with a slimmed down sense of software versioning and then when the provenance DB is available we can use that and gain a lot of functionality. So I guess the question is whether the final provenance solution is compatible with the use-case I described.

Correct. All of the I/O goes through the Butler, though, so if there’s some RESTful API being planned for the Butler (which wouldn’t surprise me at all), that may be how to make these solutions talk to each other long-term.

In the longer term, the pipeline harness (which I’ll use instead of CmdLineTask or SuperTask or whatever is wrapping the science algorithms) should likely write directly to the provenance database. This can be intermediated by the Butler (once the Butler’s database-writing ability is restored). Using the Butler (whether RESTful or not) to access file-based provenance is probably not ideal from the QA perspective, as it will not provide the query capabilities that the database would.

In the shorter term, adding a loading step to ingest file-based provenance into the provenance database might not be that difficult.