DRP Overview Block Diagram

jbosch · April 7, 2016, 2:16pm

As many of you know, I’ve been working on a block diagram describing the DRP processing flow under the auspices of the Science Pipelines Working Group:

https://confluence.lsstcorp.org/display/~jbosch/Data+Release+Production+Top-Level+Overview

It’s not done - I’ve only added detailed descriptions for a small fraction of the diagram, and I find that the diagram changes needs to change slightly every time I add another description. But I’ve heard some people (especially @robkooper) may be interested in looking at even a work-in-progress version, and it should be at least near completion by the end of this week. Of course, at that point, it will still be subject to further review by the rest of the SciWG.

It’s also worth noting that I plan to extract a document describing parallelization use cases and requirements once the diagram and its descriptions are done, and I think that will be much more useful for the team designing our orchestration middleware. I’m hoping to be able to deliver that some time late next week.

mjuric · April 8, 2016, 2:01am

Jim, could you make sure it’s linked from the scipi-wg confluence page?

jbosch · April 11, 2016, 6:05pm

The latest draft of the diagram is now complete. Comments are quite welcome. However,

if you’re commenting on the clarity or self-consistency, please comment on DM-5674;
if you’re commenting on the feasibility, desirability, or other scientific/algorithmic content, please comment on the bottom of the Confluence page itself.

FabioHernandez · April 22, 2016, 12:57pm

Hi @jbosch,

I would like to comment on this work. The associated DM-5674 is marked as Done so I’m commenting here since I’m not sure it is worth reopening that issue just for my comments. I hope you don’t mind.

First of all, let me say that I find this pictorial description very valuable. I’m reading this as someone who needs to understand how the processing will be done in terms of inputs and outputs of each step in the pipeline, the precedence relationships between steps and each step’s potential of parallelisation. Some of my comments / suggestions may be obvious for someone who understands the scientific purpose behind each pipeline: sorry if that’s the case, but they only reflect my ignorance. I don’t know if your intention is for this diagram to be useful beyond the Science Pipelines working group: I’m assuming this is the case and this is why I’m providing feedback.

Here are my comments:

I’m assuming this diagram is a companion to the DPDD. If so, I’m expecting to find the same vocabulary or at least to refer to the same concepts. For instance, in DPDD (Latest Revision 10/7/2013, page 34, Figure 1) I find the name of the pipelines to be Single Image Processing or Relative Calibration which I don’t find in your diagram. I’m guessing it corresponds to bootstrap_imchar and bootstrap_jointcal pipelines respectively but I don’t know for sure. Another example, in your diagram you use the term magzero which I’m guessing is the same concept of DPDD’s zero point but I’m not sure. Would be nice both the DPDD and your block diagram to be consistent.
I think that a comment for each pipeline on the name of the stack’s task which implements the pipeline would be also very useful. For instance, something like: the processCCD.py task implements the pipeline bootstrap_imchar.
I would suggest to adopt a single structure for the textual description of each pipeline. For instance, using the following sections: Purpose, Inputs, Outputs, Implementation, Processing. The Implementation section would include mostly all the description text currently in the document, that is, the how the pipeline is performed in terms of science (e.g. algorithms, thresholds, whatever). The Processing section would include comments about execution of the pipeline such as parallelization opportunities, storage areas used for each step, name of the stack’s task (see previous item), etc.
I would suggest the diagram makes a visually explicit distinction of data stored in the form of a file (e.g. an image) and data stored in the (Qserv) catalog. This distintion is important for me because the storage platforms to be deployed for those kinds of data products (images and catalogs) are different. Also, it would be useful to visualise what pipelines need to interact with the catalog, for retrieving inputs or for updating the catalog with the pipeline’s output.
It would be useful also to clarify for each pipeline what inputs are modified by the pipeline (if any). Ideally, at the level of a single pipeline all inputs would be read only and all outputs would be write-only, but I don’t know if this way of working is actually the project’s intention. This information is also very valuable for people like me who are responsible for designing the storage infrastructure for the execution of each pipeline. For instance, the pipeline bootstrap_jointcal takes as input calexp[0.ic] and produces calexp[0]. Does this mean that this pipeline modifies the input calexp or instead it produces a new version 0 of the calexp taken as input?
I would suggest to add to the diagram the meaning of N in the statement FOR n IN [1, N] in the block Direct Image Processing
I would suggest to add to the diagram the meaning of M in the statement FOR n IN [1, M] in the block Difference Image Processing?
Is there any way for me to know from the diagram or from the associated explanation of each pipeline, what data products need to be permanently stored as opposed to those which are stored in the scratch disk area during the ongoing data release processing? I’m a bit confused by the Intermediate Data Product (purple disk) and the intermediate component attribute of the Output Data Product (red disk).

Again, thanks a lot for this great work.

jbosch · April 22, 2016, 6:13pm

At this stage, my hope is that it will be useful outside the working group with the caveat that we know we haven’t put sufficient effort into making it useful outside the working group.

Most of your questions reflect concepts that are still TBD, but I’ll reply to each one in detail.

This is a known issue that the SPDWG is working on, and I expect it to be resolved by the end of the cycle. That said, zeropoint vs. magzero is a particularly easy one to fix that I hadn’t noticed, and I’ll just ahead and update the diagram now.

The problem is that we don’t really have a 1-to-1 or even many-to-many relationships between the current codebase and the planned future processing, so I’m worried trying to define such links would produce more confusion than clarity. We’ve talked about trying to put together a similar diagram that describes the current state of the pipeline - which we could map to the current codebase - but that’s currently lower priority that some other working group tasks I have.

My hope is that after the SPDWG work is done this cycle, we will relatively quickly refactor our current pipeline to match the proposed future pipeline, with no-op or placeholder components where we haven’t implemented the algorithms we need. That will be a relatively big change that doesn’t bring any obvious near-term benefits (and may have some obvious drawbacks) for current users of the pipeline, though, so it may be optimistic to think that this can happen quickly.

This is a good idea, and something that I avoided in this draft simply because trying to impose even that minimal structure has gotten in the way of me churning out the content itself in the past. I hope to improve the organization in the future, and this is a good starting proposal for how.

From the perspective of the Science Pipelines, these distinctions are actually a middleware implementation detail. Deciding what goes directly into the final public database, what goes into a (smaller? non-qserv?) data release production support database, and what goes into standard files on disk is ultimately up to the Data Access team (though people like me of course need to provide them enough information to make those decisions).

The “versions” in the diagram (inside the square brackets) are supposed to indicate when a data product is updated conceptually, from the standpoint that previous versions are now superseded from a quality standpoint and will not be used by any later pipeline stage. This is complicated by the fact that some of my data products have components, and some stages will update one component without updating others.

Here are some principles that may help to explain what’s going on:

I’ve incremented the version of a data product whenever one component is updated.
A particular version of a data product implicitly includes the latest versions of all components.

I think we clearly need to have the ability to store non-final versions of data products for diagnostic purposes, but I also think we need to have a system that allows us to simply retrieve the latest version of a compound data product, since that’s what nearly all stages will want as input. When updates are written, I consider it a middleware implementation detail whether they’ll actually overwrite files on disk or write new files while (possibly) deleting old ones.

This diagram would probably be much clearer if I didn’t have any multi-component data products. That’s not the way most of our pipeline developers think of these data products, so I’m not sure I want to do that, but it’s worth considering, and it’s the reason this is so confusing. Anyhow, here are the principles:

If a pipeline stage writes the final version of any component in a data product, I’ve colored the data product red, and indicated the component with italics.
A non-italic component in a red data product is a non-final update to that component.
A data product for which no updates are final is colored dark purple.

I’ll try to include most these new “how to read this diagram” points in the diagram page itself sometime next week.

timj · April 22, 2016, 6:18pm

DPDD is actively being edited and updated as part of the science pipelines working group process.

ktl · April 22, 2016, 8:17pm

This diagram will not go into the DPDD but should instead be in the new Science Pipelines Design document or documents that will replace the current LDM-151.

Our goal is to have the science pipelines read from “normal” databases only, where databases are required at all. While final data products will be incrementally loaded into Qserv as they are produced, Qserv should never need to be queried by the pipelines – only QA (and, of course, eventually science users).

I would very much like to avoid overwrite-in-place of prior pipeline outputs since it adds complexity to provenance, recovery, etc. I have come to understand that scientists don’t necessarily think that way, but I do think that we can handle this in middleware.

jbosch · May 19, 2016, 10:25pm

For those of you following these diagrams (I know at least @hsinfang and @srp are), please note that I’ve just made a relatively major change to both my data flow diagram and my parallelization diagram: I’ve removed MOPS as a sequence point in the middle of the processing. This is following a Science Pipelines Definition Working Group discussion, in which we decided that MOPS was unlikely to be helpful as an input to coaddition, and hence we can move it to the end of the DRP, after all of the image processing is complete. I’ve moved it there in the data flow diagram, and just removed it entirely from the parallelization diagram (since I haven’t yet attempted to show any of the other catalog-only afterburner processing there yet).