Data Provenance - mapping processed data to processing nodes

I am building a small prototype to uncover pitfalls related to capturing provenance, and here is the first one worth discussing… the same task will run in parallel on many machines. The thinking so far was that we will keep track of provenance of a group of “things” we process (say a group of exposures) and each group would be processed by exactly one node. Then we could just keep the mapping: group–> node. Such approach reduces size of provenance we need to track. The issue: we need to keep the mapping from individual “things” to a group. My current thinking:

For exposures:

  • keep the mapping as-is, e.g., list of exposureIds --> groupId. This is most robust. It yields non-negligible volumes of provenance. (~35 GB, assuming ~28 million CCD visits in DR1, up to 1/2 billion CCD visits in DR11, and 11 data releases), but that is not too bad!
  • an alternative: determine the mapping programmatically, e.g., use some hashing scheme to map exposureIds to nodeIds. That may yield uneven distribution / (overloading/underloading some nodes), and even more importantly, complicate things for the processing algorithms, which may want to control which exposures should be processed together and/or where.

For individual objects/sources/forcedSources, the granularity has to be per-group due to the sheer count of objects/sources. Grouping could be determined based on

  • the exposure given object/source came from, or
  • the location on the sky
  • or something else, but I feel someone with knowledge of apps algorithms could determine that better than me…

Anyway, I feel this needs some thinking/discussion!


I m so sorry I have had conflicts with the provenance discussions so far, and I’ve only skimmed the documents. I am impressed by how different the discussion is here, compare to what has been implemented in DES, and while I am no expert, the apparent divergence of of this discussion from my (non-expert) understanding of how provenance is usually approached strikes me.

DES provenance seeks to record a few assertions like “was used by” and “was derived from” using these we can trace, say from a coadd though the images that incorporated it, to the configuration files for any program that computed an intermediate, or down to the specific flat exposures that were used to process a single exposure.

DES provenance is verbose, and I have always thought that it could be improved by subsuming literals in a run we all use the same configurations, for example, and the same flats. (DES did not do this, but DES is on a pretty strict resource budget and reforms its provenance during operations – always a tricky thing). I think in any case LSST like DES, has the “easy” case of provenance – may short trees, as oppose to a few very deep trees.

I guess my question is really How much work have we done looking into to existing body of provenance knowledge? OR alternatively do we really mean provenance or do we mean “information to allow reproduction of the data” – a slightly different question, to my limited understanding…

We are in the process of looking into what has been done, I studied the open provenance model and @mgelman2 promised to send me some docs about DES provenance during the call we had yesterday.

Yes, indeed we are putting lots of focus on capturing “information to allow reproduction of the data”

When I started at NCSA, Joe Futrelle gave me what I now call “futrelle’s curse” DES has stored provenance in multiple special, unrelated tables. Futrelle’s curse" was “well and fine, but you’ll never query it without storing it in a regular way.” Kinda stuck me as really really true (and unappreciated by me at the time).

shodl ahem added so we reformed the provenance – in line with comments above.

@jbecla in the DES Science Portal we have implemented a pipeline-centric provenance model (Groth et al 2009) as an extension of our workflow system. A nice application is the infrastructure we are developing to build science-ready catalogs. It involves a large number of pipelines and data products until you get to the final catalog. In this implementation we keep track of all pipelines executed and since each pipeline has a well defined input data, configuration and code version we can have full control of the catalog created. I don’t know much about OPM or other provenance models but that one fits well with our needs. Regarding to your question, for a given pipeline execution we have a process_id and several job_id’s that are executed on each processing node or core. From the job_id I get the input data (list of files or the database query used to retrieve the data) and the processing node/core where it was executed. The data can be partitioned in several ways depending on the application and performance requirements. For the kind of problem we are trying to solve, it was natural to extend the workflow system to support provenance and I cannot imagine how to separate these two things, perhaps I am too biased to our current implementation and would be nice to learn what others have been done.

Hi Angelo,

Thanks! I have looked in details at OPM, and tried to use some of the approaches from there that I thought were applicable to us. I’d be very much interested in your early feedback about the provenance architecture I came up with, please peek at my work in progress at: and let me know what you think.