Stack provenance in Tasks?

price · March 16, 2016, 8:43pm

The HSC stack contains code to record the stack provenance, in a similar way as is done with configurations: if the file root/config/eups.versions exists, it is read in and compared with the current setups and it is an error if there are any differences (which can be overridden with --clobber-config); if it does not exist, it is created. This kind of functionality is important for production runs, where one wants to guarantee that the entire production was done with a single version of the stack, and the HSC port would not be complete without some functionality like this.

While we have a working implementation from HSC, I don’t think we want to blindly port it HSC because:

We don’t want the LSST stack to be dependent upon eups.
We don’t want to be prevented from running something because we setup something unrelated.
The provenance clobbering is linked with the configuration clobbering, whereas the two can be orthogonal.
The provenance checking is subject to the same problem with parent repos that we get with the configuration checking.

The first two points likely drive us to introspecting python modules for their versions, as suggested by @jbosch. Some careful consideration of what command-line switches are necessary can deal with the next point. I don’t know how we deal with the final point.

Some questions:

Are there objections to adding such a scheme to the CmdLineTask?
How would this link in (or not) with other LSST provenance plans?
Do we want to have some mechanism that will allow minor version changes in the middle of a big production run (because you don’t want to have to reprocess everything when you discover a bug with an obvious fix)? If so, how do we define “minor version changes”?

timj · March 16, 2016, 8:52pm

An outline of the current provenance plan can be found at:

github.com

lsst-dm/provenance_proto/blob/master/Provenance.md

# LSST DM Metadata and Provenance

Metadata encompasses all information *about the data*, including data descriptions, locations, and annotations. Capturing, maintaining, and providing efficient access to metadata is as important as capturing, maintaining and providing efficient access to the data itself.

Provenance is typically considered a subset of metadata. Provenance is fundamentally concerned with two things: (1) recording the state of relevant portions of the system and (2) associating state with the generation of an output from inputs. Capturing provenance is essential for a variety of reasons; the top two being (1) ability to rebuild data products too large to cost-efficiently store persistently, such as ScienceCalibratedExposures, and (2) ability to run forensic analysis in order to understand data lineage when "something goes wrong".

In LSST, all data is organized into *repositories*. A repository is any place where related data is stored. Examples of an LSST repository include a database containing a complete data release, a set of raw exposures used by a Data Release Processing, a group of exposures brought by L3 user from other survey, or a database with a small table produced by L3 user. Information about repositories (the "metadata") is managed through Central Metadata Store. Provenance information on the other hand is kept closely with the data: for a database repository it is kept inside the database, for a file repository it is kept in the *related* database repository as explained in the following sections.

## 1. Central Metadata Store

The Central Metadata Store (CMS) keeps track of all LSST DM *production* repositories, as well as L3 repositories registered in CMS by their owners. It is expected majority of L3 repositories not considered a temporary / scratch space will be registered in CMS. Registering a repository will insure it is backed up in a production manner.

One CMS is maintained for each Data Center. The information kept in CMS can be in particular useful for "discovering" repositories based on most common search criteria, such as owner, creation time, type, etc. It is available through two interfaces: (a) RESTful API (Webserv), and (b) direct SQL querying.

 The diagram below depicts links between different tables used for CMS.

![Metadata Schema Diagram](metaSchema.png)

*Note that the existing version of all metadata tables mentioned here serves as a "skeleton". Additional commonly accessed fields will be added to these tables as deemed necessary to efficiently support most common queries.*

This file has been truncated. show original

ktl · March 16, 2016, 11:09pm

It is not obvious that we want production-like runs of the LSST stack to be independent of eups.

Given that setup of a package can have indeterminate effects on the rest of the stack, this is also not obvious for production-like runs.

I’d rather not go ahead with adding something like this right now if it can be avoided, particularly if it is more complicated in implementation than the HSC functionality. I am worried that an increasing amount of code is being written that is de facto making decisions that change our baseline. In this case, provenance recording was assigned to the orchestration layer, not the pipeline construction toolkit.

Tim has linked to the current provenance storage plans, but that has nothing to do with provenance capture.

In DRP, I think we generally would want to add “afterburner” pipelines rather than modifying a given package on the fly, leading to possibly inconsistent processing for data across the data release. But we do need to allow for possible version changes for code and especially for configuration items. Jacek’s provenance schema does allow for this, but again the processes around how to actually modify the production and how to capture the modified versions are not yet fully defined.

swinbank · March 17, 2016, 2:08pm

For what it’s worth, my reading is that the baseline is pretty woolly on this point: LDM-152 §7.1.2.2 on the pipeline construction toolkit specifically refers to command line tasks being “directed to capture their runtime environment, including the versions of software packages in use” as an alternative to relying on the orchestration layer.

However, that’s quibbling (sorry): hopefully, any ambiguities about where this work falls will be resolved by the ongoing planning process within a few months.

Until we get there, though, what’s the best way to proceed? Could we consider, for example, porting the current HSC code as-is to enable them to make progress over the next several months, with the understanding that it will ultimately be replaced by LSST?

jbosch · March 17, 2016, 4:52pm

My $0.02 on this is that we should not port the HSC stuff as-is, but reimplement just the version-getting part now with the following principles:

Use our own version.py files to read the versions from our own software instead of EUPS.
Enumerate a set of third-party packages we also care about versioning, which may not be complete, but is sufficient to capture anything we expect to change pipeline outputs. (It’s not my call whether that’s sufficient for LSST; I think it is for HSC). Write version-extraction functions for all of these (e.g. import python modules and look at __version__, compile some C++ code to inspect preprocessor macros, etc).
Attach these to the pipeline in roughly the same manner we’ve attached them in HSC - as butler datasets stored by CmdLineTasks in much the same way as the configuration and schemas.

That doesn’t address all of @price’s concerns about the state of the HSC stuff, but I think it handles the more serious problems.

And it doesn’t address @ktl’s concern about whether this should be in orchestration instead, but I don’t think anything here precludes us from moving it there later. In particular, the version-getter functions could probably be continue to be used after such a refactoring, and that’s the only part that’s possibly more complex than the current HSC functionality.

timj · March 17, 2016, 5:04pm

I am strongly in favor of the runtime reporting its own versions rather than obtaining it from the environment.

Although I imagine the real problem is what to do with the information. The provenance handling system doesn’t want 50 individual version numbers. This is where EUPS wins of course since it’s a single build number. Without that you have to use git tags and then you aren’t entirely clear what really ran, only what you think was installed.

srp · March 17, 2016, 8:38pm

IMHO, all stack packages which are set up should be captured, because we presumably want to be able to replicate results later, for debugging purposes and otherwise.

We’d previously been capturing all package information (even to the extent of where it was on disk), including the orchestration itself and its dependent packages, and the policy files used for the run.

RHL · March 23, 2016, 6:46pm

How does an afterburner deal with critical bug fixes?

ktl · March 24, 2016, 4:38am

It depends on what type of bug it is. If it’s an incorrect calculation that can be fixed up afterwards, an afterburner could be sufficient. If it’s something that is infinite-looping or crashing, then a different kind of fix would be needed (which might also involve the first kind – e.g. if we’re going to get stuck, just output a NaN and then later do a different, more complicated calculation possibly involving different inputs). I would expect that testing can reduce the type 2 bugs more than the type 1.

RHL · March 24, 2016, 4:08pm

My guess is that both types will require reprocessing as they require pixel access – if you’d rather do that as an afterburner it might be doable (depending on how hard it is do a hot restart).

I also expect that there will be more SEGV level problems than you expect as nature is good at finding edge cases in code! But that’s based on my SDSS experience that was not, umhh, well unit tested.

ktl · March 25, 2016, 5:32am

Asking for flexibility in code and configuration when we’ve already done significant processing of many patches of sky is not going to be easy while also maintaining consistency across the whole Data Release and having reasonable provenance. I think we can provide some capability in this area, but I’d much prefer to avoid having to use it by: 1) testing the code, 2) running a mini-production, 3) forcing a restart of the entire DRP (or at least going back to the last global sequence point if we have any), and/or 4) waiting to fix it until the next DR.

RHL · March 31, 2016, 4:10am

As you well know, the “standard” workflow is to do extensive testing, then branch to create a production version. Any critical fixes get made on this branch. While you’d like to avoid this, I’m afraid that we should plan to support it although I fully understand that this makes provenance harder.

I am not worried about the consistency issue. Fixes only get merged to the production branch upon agreement of [choose your body], and with the expectation that users need not care – although, of course, we need to be able to report exactly what we did.

ktl · March 31, 2016, 3:14pm

That’s not what I was referring to, nor what I think Paul was referring to by “minor version changes in the middle of a big production run”. The question is not how to track versions of the code that come from different branches, the question is how to track datasets (or, worse, individual data items) from the same (claimed) production run that were generated by different code because of modifications to that code while the run was in progress. It’s easier to say that all Objects were processed with afterburner X version A.B.C than to say that 1% of the columns of 10% of the Objects were processed using measurement code Y version D.E.G while the rest were processed using Y version D.E.F.

RHL · March 31, 2016, 10:47pm

I think it’s exactly what @price meant (but he should step in). You process some data, find a critical bug, fix it, and proceed with the rest of the data using the new version.

price · April 1, 2016, 12:33am

@RHL’s interpretation is closest to what I was thinking, but I believe @ktl’s interpretation is a corollary.

This has already overflowed one (short) sprint. I would like to move forward with one of the following options:

Port the HSC provenance code as-is.
Port the HSC provenance code with some changes (e.g., following @jbosch’s suggestion).
Agree to defer worrying about provenance.

I think @ktl is the only one standing in the way of adopting option #2. Is that true?

price · April 13, 2016, 1:26pm

Having heard no objections, I will proceed with an RFC for option 2.