Job-level provenance

Proposed date: JTM
Connection: In person
Suggested audience: @gpdf @fritzm @ktl NCSA? who else cares?


I’d like to start a discussion about provenance. In particular I am interested in:

  • Describing SQuaRE’s requirements for job-level provenance (explanation below)

  • Establish whether DM has already planned this type of capability (eg. through the SLAC provenance work package)

  • If not, discuss potential solutions/way forward.

In our verification workflows (eg. those implemented by SQuaSH) we calculate metrics (not just KPMs, but arbitrary characterisations of performance). However in consuming those metrics (eg. in creating regression plots, excursion alerts etc) we group those metrics by certain common meaningful characteristics. For example, was this AM1 metric calculated with the testdata_hsc dataset or the testdata_chft dataset? Was it r band only, or all bands? Ultimately we want to compare like-with-like, so we need a definition of “like”.

In our context, establishing “like” is done with what we are calling the “provenance” of the metric. In the supertask/activator paradigm, for those familiar with that, the “provenance” is generally things the activator knows as opposed to things the supertask knows - what data (butler repo, whatever) was I run on? On what OS? What was the configuration passed in? Etc.

Typically in astronomy we think of provenance as the ancestry of, say, a data file (what individual exposures contributed to this mosaic? What was the version of the mosaic software that produced it? But in our case, the provenance of the metric is really the provenance of the “job” (whether a Jenkins job now or a production workflow system job ultimately). Not only can each job potentially generally metrics with different provenance, but even in cases where the provenance is the same, we track metrics versus time by tracking by job (this is the essence of a regression plot - here’s the number from this job, here’s the number from yesterday’s job, here’s the number from the day before yesterday’s job, etc etc)

Moreover, unlike situations where provenance information is largely forensic (and so does not necessarily justify tooling to access it), for this job-related provenance we need an API to interact with it as part of the normal operation of the verification system. Hey, I just calculated this metric, do I group it with the “AM1 from cfht data” bucket, or the “AM1 from hsc data” bucket? Oh it’s provenance is testdata_cfht, so “AM1 from cfht data it is”.

As with other aspects of SQuaSH, we initially assumed that we would be writing a cheap shim and throwing it away later for a planned production system. Today @fritzm and I had a very useful discussion (thanks!) in which we reviewed some of the current plan for the provenance capability planned for development at SLAC (eg see Jacek’s provenance prototype). It is relatively clear that if Fritz was to act on that design literally (which, I hasten to add, he has not necessarily committed to), this would not fulfil my job-based provenance use cases, as it stands it is very catalogue-row oriented. Also, no general access (eg dax_) had been envisaged.

What I’d like to understand better:

  • My immediate need here (since we are shimming what we need anyway to avoid being blocked) is to understand whether I’m developing a throwaway temporary system or whether I [royal I] should be thinking about a production grade system because no other work package will provide the functionality I need.

  • My own feeling is that the planned provenance system has enough information in it that a second, additional job-based provenance system would involve too much duplication; so perhaps a way forward is to for Fritz to incorporate some of my provenance representation needs, and I could provide the dax_ type API to it? Interested in Architecture’s guidance. I am obviously concerned about have to do unplanned/unfunded work, so overall I’d rather we found a way to address this through an already planned package.

  • Timing is also an issue as per Fritz, the provenance work is very late in Construction. Since the primary delivery date for verification/QC/etc tooling is ComCam/commissioning. I’d be going into my “operations” ahead of that potentially.

  • From experience, I would not be surprised if NCSA’s workflow work (whose details I am ignorant of) results in job-level provenance requirements too, so keen to avoid duplication there. Are there any other stakeholders lurking?

I welcome comments here, and perhaps interested parties can also meet at JTM if there is sufficient interest. If the answer is “you and Fritz sort it out, we don’t care” that’s fine too, but I would be surprised :slight_smile:

2 Likes

I should add that I wonder if some of my usecase is common with a “L3” type user, ie somebody who wants to filter results on the basis that they came from one of their jobs.

I’m concerned that any provenance system needs to support interactive “QA” type tasks, as performed by the Pipelines groups during development and by the Science Verification group. However, before we can discuss how their needs impact on the design of the provenance system, we need to better understand what the overall QA plan is. (Even as I write that, I’m worried that I’m getting the terminology “wrong”, which I think just serves to illustrate how immature our thinking on QA is.)

In other words: we should regard Pipelines and Science Verification as important stakeholders for provenance, but this discussion may be too early for them to have a proper understanding of their requirements.

My first reaction is that Science Pipelines will want a lot of what you want too, and if that’s not in DAX’s plans for provenance, it’s probably an indication we didn’t engage with them enough - or that we were perhaps thinking of the job-level provenance and database provenance as being more distinct than they are, and that job-level provenance wasn’t really being designed yet.

But my second reaction is that I’m not convinced job-level provenance actually is anything more than just connecting jobs to the output data repositories they produce. We already work fairly hard to prevent an individual output data repository from including data products with different provenance, and I expect that to become formalized further in SuperTask. So any job (CI or human-launched) can easily have a one-to-one relationship to a data repository with well-defined provenance (since it can always chose to create a new output data repository), and at worst multiple jobs will be associated with a single output repository because they have the same provenance. I’m a little worried that this scheme doesn’t leave room for things that can unexpectedly change between different jobs with the same provenance (e.g. we define “provenance” in a way that doesn’t include the version of some OS library that we don’t expect to matter, but it does), but I don’t know if that’s a big enough concern to merit extra work to make it easy to debug such problems.

I’m also not sure how the data repository provenance system we’ve been using for pipeline jobs (and planning, at least vaguely, to enhance with SuperTask) relates to the provenance prototype Jacek put together. I had been assuming that storing provenance at the level of an output repository was a requirement since it was something we already did (and hence dropping it would be a regression), but I could also imagine it being considered an implementation detail that could be superseded by a finer-grained system (which in this context would be much harder to relate to jobs than repository-level information).

We already work fairly hard to prevent an individual output data repository from including data products with different provenance, and I expect that to become formalized further in SuperTask.

The problem is that “I” (semantic shortcut for the verification usecase) don’t care about the output data repository, I care about the input, and not in the sense of it being identical, but in the sense of it being identifiable as member of a certain class. For example, when you get your AM1 regression timeline in SQuaSH after a run on testdata_cfht, even if you check in some new files in testdata_cfht, you still want to identify that next AM1 measurement as part of that timeline, you don’t want to drop your whole history and go “something has changed!” [though of course you do want to know that something changed, and our annotations indicate that]. It really is a property of the job and not the data/products/repo.

I am pretty sure based on just the replies so far that we ought to talk about this in person, I’ll see if I can identify a slot at the JTM.

As far as the job-level stuff goes -

When we were using policy files, orchestration used to the policy file driver file itself and all software versions of the stack that was being executed. The original code, which I believe Jacek wrote back then, got bit-rotted when we switched away from using policy files. (I can’t remember if it was in ctrl_provenance, or another package…it’s been a while).

We’re planning on saving that software package information, along with information about when jobs were submitted by the workflow driver, when they started executing, where they started executing, when they stopped, memory footprint, and some more things. This will all be saved to a set of database tables by orchestration.

We should definitely discuss more about what everyone’s provenance needs are and how we can meet them without duplicating work. There are provenance requirements for the Batch Processing System and Data Backbone. I would be interested in participating in any discussions/meetings.

Provenance inside the Data Backbone and Batch Processing System will not be repository based. The Batch Processing System will be making new repositories for every computing job. The current plan was to use something similar to the Open Provenance Model that tracks input and output files with “processes”. Each “process” could be part of a larger process (for example a single compute job or a single execution of a pipeline). Information would also be stored for each “process” (e.g., start and end times, execution host, arguments/parameters, exit status, etc). Files also have metadata stored. The plan was to store lots of data during production which could be reduced for the data release.

Sorry if I’ve missed it some place else. Is there a proposed time for meeting at JTM?