The HSC stack contains code to record the stack provenance, in a similar way as is done with configurations: if the file root/config/eups.versions exists, it is read in and compared with the current setups and it is an error if there are any differences (which can be overridden with --clobber-config); if it does not exist, it is created. This kind of functionality is important for production runs, where one wants to guarantee that the entire production was done with a single version of the stack, and the HSC port would not be complete without some functionality like this.
While we have a working implementation from HSC, I don’t think we want to blindly port it HSC because:
We don’t want the LSST stack to be dependent upon eups.
We don’t want to be prevented from running something because we setup something unrelated.
The provenance clobbering is linked with the configuration clobbering, whereas the two can be orthogonal.
The provenance checking is subject to the same problem with parent repos that we get with the configuration checking.
The first two points likely drive us to introspecting python modules for their versions, as suggested by @jbosch. Some careful consideration of what command-line switches are necessary can deal with the next point. I don’t know how we deal with the final point.
Some questions:
Are there objections to adding such a scheme to the CmdLineTask?
How would this link in (or not) with other LSST provenance plans?
Do we want to have some mechanism that will allow minor version changes in the middle of a big production run (because you don’t want to have to reprocess everything when you discover a bug with an obvious fix)? If so, how do we define “minor version changes”?
It is not obvious that we want production-like runs of the LSST stack to be independent of eups.
Given that setup of a package can have indeterminate effects on the rest of the stack, this is also not obvious for production-like runs.
I’d rather not go ahead with adding something like this right now if it can be avoided, particularly if it is more complicated in implementation than the HSC functionality. I am worried that an increasing amount of code is being written that is de facto making decisions that change our baseline. In this case, provenance recording was assigned to the orchestration layer, not the pipeline construction toolkit.
Tim has linked to the current provenance storage plans, but that has nothing to do with provenance capture.
In DRP, I think we generally would want to add “afterburner” pipelines rather than modifying a given package on the fly, leading to possibly inconsistent processing for data across the data release. But we do need to allow for possible version changes for code and especially for configuration items. Jacek’s provenance schema does allow for this, but again the processes around how to actually modify the production and how to capture the modified versions are not yet fully defined.
For what it’s worth, my reading is that the baseline is pretty woolly on this point: LDM-152 §7.1.2.2 on the pipeline construction toolkit specifically refers to command line tasks being “directed to capture their runtime environment, including the versions of software packages in use” as an alternative to relying on the orchestration layer.
However, that’s quibbling (sorry): hopefully, any ambiguities about where this work falls will be resolved by the ongoing planning process within a few months.
Until we get there, though, what’s the best way to proceed? Could we consider, for example, porting the current HSC code as-is to enable them to make progress over the next several months, with the understanding that it will ultimately be replaced by LSST?
My $0.02 on this is that we should not port the HSC stuff as-is, but reimplement just the version-getting part now with the following principles:
Use our own version.py files to read the versions from our own software instead of EUPS.
Enumerate a set of third-party packages we also care about versioning, which may not be complete, but is sufficient to capture anything we expect to change pipeline outputs. (It’s not my call whether that’s sufficient for LSST; I think it is for HSC). Write version-extraction functions for all of these (e.g. import python modules and look at __version__, compile some C++ code to inspect preprocessor macros, etc).
Attach these to the pipeline in roughly the same manner we’ve attached them in HSC - as butler datasets stored by CmdLineTasks in much the same way as the configuration and schemas.
That doesn’t address all of @price’s concerns about the state of the HSC stuff, but I think it handles the more serious problems.
And it doesn’t address @ktl’s concern about whether this should be in orchestration instead, but I don’t think anything here precludes us from moving it there later. In particular, the version-getter functions could probably be continue to be used after such a refactoring, and that’s the only part that’s possibly more complex than the current HSC functionality.
I am strongly in favor of the runtime reporting its own versions rather than obtaining it from the environment.
Although I imagine the real problem is what to do with the information. The provenance handling system doesn’t want 50 individual version numbers. This is where EUPS wins of course since it’s a single build number. Without that you have to use git tags and then you aren’t entirely clear what really ran, only what you think was installed.
IMHO, all stack packages which are set up should be captured, because we presumably want to be able to replicate results later, for debugging purposes and otherwise.
We’d previously been capturing all package information (even to the extent of where it was on disk), including the orchestration itself and its dependent packages, and the policy files used for the run.
It depends on what type of bug it is. If it’s an incorrect calculation that can be fixed up afterwards, an afterburner could be sufficient. If it’s something that is infinite-looping or crashing, then a different kind of fix would be needed (which might also involve the first kind – e.g. if we’re going to get stuck, just output a NaN and then later do a different, more complicated calculation possibly involving different inputs). I would expect that testing can reduce the type 2 bugs more than the type 1.
My guess is that both types will require reprocessing as they require pixel access – if you’d rather do that as an afterburner it might be doable (depending on how hard it is do a hot restart).
I also expect that there will be more SEGV level problems than you expect as nature is good at finding edge cases in code! But that’s based on my SDSS experience that was not, umhh, well unit tested.
Asking for flexibility in code and configuration when we’ve already done significant processing of many patches of sky is not going to be easy while also maintaining consistency across the whole Data Release and having reasonable provenance. I think we can provide some capability in this area, but I’d much prefer to avoid having to use it by: 1) testing the code, 2) running a mini-production, 3) forcing a restart of the entire DRP (or at least going back to the last global sequence point if we have any), and/or 4) waiting to fix it until the next DR.
As you well know, the “standard” workflow is to do extensive testing, then branch to create a production version. Any critical fixes get made on this branch. While you’d like to avoid this, I’m afraid that we should plan to support it although I fully understand that this makes provenance harder.
I am not worried about the consistency issue. Fixes only get merged to the production branch upon agreement of [choose your body], and with the expectation that users need not care – although, of course, we need to be able to report exactly what we did.
That’s not what I was referring to, nor what I think Paul was referring to by “minor version changes in the middle of a big production run”. The question is not how to track versions of the code that come from different branches, the question is how to track datasets (or, worse, individual data items) from the same (claimed) production run that were generated by different code because of modifications to that code while the run was in progress. It’s easier to say that all Objects were processed with afterburner X version A.B.C than to say that 1% of the columns of 10% of the Objects were processed using measurement code Y version D.E.G while the rest were processed using Y version D.E.F.
I think it’s exactly what @price meant (but he should step in). You process some data, find a critical bug, fix it, and proceed with the rest of the data using the new version.