"Quality" information - what, where, how

The issue of how to handle “quality” information has come to the fore recently. As something that hasn’t been well-described in our requirements or design documents to date, there is great potential for misunderstanding. Here, based on a HipChat conversation this morning, I (with input from others) propose certain definitions for types of information that might be considered under this umbrella and where discussion and definition of these topics should occur.

Image Information

This data is primarily science-oriented. It is computed mostly by pipelines themselves; some may be computed by SQuaRE metric-computation insertions into productions. This data generally goes into the ScienceCcdExposure table in the Science Data Archive (from which it may go into a CI dashboard and later a production dashboard).

Definition and generation of this data should be discussed among the Science Advisory Council, Project Science Team, Science Pipelines, and SQuaRE; the DB team will implement their storage in the archive and thus needs to be informed as to number, size, and complexity.

Some of these data items may rise to the level of importance of the Data Products Definition Document (LSE-163) but the design/definition of the others should go into a QA document forked from the Science Pipelines Design Document (LDM-151). All such data items should be incorporated into the Database Schema document (LDM-153) or its successors.

Source or Object Quantities

There will likely be quantities measured on Sources or Objects that are useful for algorithm validation and quality analysis but aren’t useful for final science analysis. That could be because they’re per-epoch measurements that are superceded by coadd or multifit measurements for science, or because they’re too algorithm-dependent or as-yet unproven to be generally useful. For instance, we may run some sort of shear estimation code in single-frame processing for the purpose of computing various ellipticity correlation null tests, but no one will actually use these for weak lensing.

Definition and generation of this data should be discussed among the Science Pipelines and SQuaRE teams; the DB team and SQuaRe may be involved in implementing their storage in the archive or in non-archive internal dashboard systems.

The design/definition of these data items should go in LDM-151 or a related QA document and eventually the DB schema LDM-153 if they go into the Science Data Archive. Note that even if we decide that they don’t make it into the Science Data Archive, we’ll still need to account for the space and tables somewhere (e.g. in a QA dashboard design section of the QA document).

Processing Metadata

Primarily oriented to determining if the production system or an integration test is working properly, has suffered regressions, or is delivering expected improved performance. This information is computed mostly by the middleware framework or by SQuaRE insertions. This data will generally go into an internal database for a CI dashboard and later a production dashboard.

The content and generation of these data items should be discussed among Science Pipelines and SQuaRE and Process Control. Storage should be implemented by SQuaRE or perhaps Process Control.

We have no current document describing developer systems and no detail on the production dashboard design, so this might go into one or more new documents (one of which might be a QA design document).

1 Like

By the way, there is useful diagnostic/QA information that is currently computed by processCcd but is not captured anywhere (just output to the screen). An example is the RMS of the photometric zeropoint computed by the photocal code. It would be nice to stuff all of this information in the “metadata”, where it would be easy to fish out. Some information is currently put there (like the PSF chi-squared), but not very much.

It should be being logged, not just printed to the screen. The log can be harvested for QA information, but I don’t know how easy this currently is. Especially as we are not yet (I think) attaching dataId information to the logs.

Does the new feature of attaching dataId in logs by DM-4342 do what you are thinking?

I think so – thank you. It’d be good to write a metadata harvester and see where we stand.

Yes, that would be possible, but it would be much easier to pull information from the “metadata” files. No need to parse a log file.