Image metadata: capture, storage, and export

Recently a number of questions have emerged with respect to image metadata, including metadata derived from observatory events and telemetry and metadata computed by DM pipelines. How this metadata is to be captured, stored, and eventually provided to productions and users has not been clearly documented in LDM-230. The relationship between the EFD and the productions is similarly unspecified.

Gregory and I discussed this a little bit today and came up with the following proposal that we think meets the requirements and clarifies what has been written without significantly conflicting.

Comments/criticisms are welcome as always; we will build the results into the developing design documents.

  • For L1, we store in the L1 Database exposure-centric metadata that includes the metadata captured for AP (which might include certain other values not actually used by AP but determined to be of interest) along with metadata computed in AP.

  • Prompt processing image ingest needs to subscribe via the SAL interface to those metadata topics to be captured as early as possible (soon after the start of integration); retrieve a limited amount of history (e.g. the previous two values) via the SAL interface; and gather all updates during exposure until a defined window (in the tens of milliseconds, exact spec to be located) after readout is complete; then feed those captured values to a predefined set of algorithms to determine what goes in the actual metadata. Examples of those algorithms might be “select last”, “compute average”, “interpolate using X method”, or even “just return the whole list”.

  • As much of the EFD as possible (minus redacted information) should be converted to exposure-centric metadata in the DR L2 Database (along with metadata computed in the DRP); the conversion happens before other DRP processing, making all of this metadata available to the rest of the DRP. This conversion needs to happen separately for each DRP because the algorithms used during the conversion may change.

  • Exposure-centric metadata from any of the above sources can be attached to raw or calibrated exposures on export. For each DRP, a default set will be defined (including at a minimum the metadata used for processing that DR) but users can subtract from (or possibly add to) it.

hmmm. I am confused. Please help. I understand that there are two paths here.

  1. Data bound for AP, or any kind of prompt processing. These may be talk corrected data. These x talked data have no retention requirement. Data handled here data is logically temporally disconnected from its generation, if we allow AP may fall behind.

  2. Data collected to be put into the archive. These are not x-talked corrected. These are put into the archive and are the basis for DRP (and any other non-prompt processing uses). Data handled for arching data is logically temporally disconnected from its generation.

Am I hearing that the EFD maintained by OCS is not a “prompt” source of truth, useful in the the case that AP is keeping pace with data acquisition?

again I am confused. “minus redacted information” – To me redaction means “omit”. I understand the logical requirement is “only allow permitted access”. What is the meaning of "redact’ here?

I am confused

I would guess that meta-data needed during a process such as DRP would be known and identified ahead of release processing. Techniques such as pre-extracting this data into files, to allow for efficient and decoupled processing would be allowed. Are you implying that there is a requirement for access to the reformatted EFD by DRP codes, and that this “prior extraction into files” mode cannot be the sole method that AP accessed data in the reformatted EFD because there is some unknowable dependency on reformatted EFD data that cannot be a data dependency prior?

U understand this to be a requirement for meta-data extraction from something like a database at every instance a user requests a file. Is this a likely consequence of what you wrote?

along with metadata computed in AP

We have tentatively established requirements on the AP codes such that they must be run when for purposed other than AP. – Ie to compute quality parameters that are essential to the scheduler. – I find no consideration of the dependencies of data so casually invented and stored for potential downstream reliance. Hence I would like to hear the thinking behind this statement w.r.t. data coupling with downstream processing. What use case did you have in mind?

Yes, there are two paths.

Crosstalk-corrected (at least for now) data bound for AP needs to have metadata captured from the OCS attached to it because we don’t want to have the science pipelines query any version of the EFD in realtime. The metadata-attacher (which I expect is the Forwarder component of the ingest system) can obtain that metadata in whatever way makes the most sense, but it seems quite likely that using the SAL interface will be preferable to directly querying the EFD in terms of reliability, latency minimization, and dependency avoidance.

Raw data to be put into the archive does not need to have metadata attached to it because that metadata will come from other places (either the AP output or DRP preparations).

Redacted information is present in the original source of truth but is not presented to end users. In this case, redacted information would not be transformed into the exposure-centric metadata tables in the DRP L2 Database. It is still present in the EFD in the Science Data Archive but with access restrictions.

The conversion from the EFD to exposure-centric metadata for use in the DRP is exactly the pre-extraction mentioned. We are trying to avoid any need for either the AP or DRP codes to access the EFD directly.

Yes, this has long been the plan and an anticipated requirement. Of course, there can be optimizations where the “usual” metadata to be attached to the pixels can be pre-serialized and persisted for more rapid retrieval, but the goal was to avoid having such persisted metadata be the source of truth.

The metadata computed by AP per exposure and visit includes those quality parameters and other data products that characterize a calibrated exposure. This metadata, which is quite distinct from the captured-from-OCS metadata and perhaps could be given a different name, is stored in the L1 Database of the Science Data Archive. See footnote 30 in section 4.3 (Level 1 Catalogs) of the DPDD and the baseline database schema. These are just data products, like DIASources. A specified selection of these will be published as telemetry and hence also be recorded in the EFD, but for most uses scientists would use the L1 Database and not the EFD as the source for these.

I might have missed this, but if we expect to receive information from SAL metadata topics during Prompt processing, where do we get this information during catchup processing?

This has to come from the EFD, if the metadata is not already attached to the raw image by the image archiving service (which was not contemplated above in order to keep archiving simple). I think it is likely that the data should come from direct EFD query (rather than, say, the SAL interface) since there are no latency concerns for catch-up, but that is a design issue that can be discussed.