Change in Task metadata

timj · January 11, 2022, 6:03pm

With the implementation of RFC-783 there has been a change in the way that the metadata attached to a Task is implemented. Previously metadata was stored in a PropertyList which was combined into a PropertySet. Now we have a specialized class for dealing with task metadata called TaskMetadata. This class provides some of the same methods as supported by PropertySet and allows . separators to refer to a hierarchy. It does not support the full API of PropertySet and all the code which relied on PropertySet methods for metadata has been fixed. Some compatibility methods do exist but issue deprecation warnings.

There have been some changes to Butler to support the new TaskMetadata.

butler.put() can now convert a TaskMetadata to a PropertySet if the dataset type for the task metadata is defined as a PropertySet in the given repository.
If a *_metadata dataset type is used by a pipeline and the definition already exists in a Butler repository then that definition will be used when storing the metadata even though the Python code is using TaskMetadata. This means that if an existing repo defines the task metadata to be a PropertySet then that will be used when writing TaskMetadata. When those datasets are retrieved they will be returned as the expected PropertySet type.
If a metadata dataset type does not exist in the repository then a new one will be created using TaskMetadata.
TaskMetadata is serialized as a JSON file and not a YAML file.
If the repository has had the dataset type for metadata modified to now indicate a TaskMetadata storage class, then a TaskMetadata will be returned by butler.get() even though it is stored on disk in YAML as a PropertySet.

With these changes there should be no need to make any changes to existing repositories although depending on configuration and history you may sometimes get a slightly different python type to the one you initially stored.

We still have to decide whether to migrate existing repositories over to TaskMetadata storage class and there is currently no butler admin script to simplify the change. For now this means that existing pipeline runs will generally still serialize as YAML PropertySet even though all processing will be using TaskMetadata internally.

Note that v23 (and DP0.2) does not support TaskMetadata and will likely never support it natively.

kfindeisen · January 11, 2022, 8:57pm

If we’re using metadata as an input connection, do we need to update it to TaskMetadata? I’m getting a conflicting definition error running such a pipeline on a fresh repository.

timj · January 11, 2022, 9:50pm

You mean you have a pipeline that is using metadata as input? It seems like we don’t have any tests that do that (I ran with ci_imsim and ci_hsc_gen3). The problem is going to be that sometimes the repository will have the metadata defined as a PropertySet and sometimes it will be defined as a TaskMetadata and because there is an insistence that pipelines must fully define their dataset types this is a bit of a problem.

I think we may have to teach pipe_base that storage class conversion is now supported by the butler. Have you got an example showing the problem that I can take a look at and experiment with?

kfindeisen · January 11, 2022, 10:31pm

Yes, I do. The following MetadataTest.yaml:

description: Demonstration pipeline that takes its own metadata as input.
instrument: lsst.obs.subaru.HyperSuprimeCam
tasks:
  isr: lsst.ip.isr.IsrTask
  timing_isr:
    class: lsst.verify.tasks.commonMetrics.TimingMetricTask
    config:
      connections.package: ip_isr  # metrics package
      connections.metric: IsrTime  # metric name
      connections.labelName: isr   # partial name of metadata dataset
      metadataDimensions: [instrument, exposure, detector]  # TimingMetricTask assumes visit
      target: isr.run              # method name in metadata. Usually matches label for top-level tasks

can be run on a fresh respository generated using:

git clone https://github.com/lsst/ap_verify_ci_cosmos_pdr2/
setup -kr ap_verify_ci_cosmos_pdr2
ingest_dataset.py --dataset ap_verify_ci_cosmos_pdr2 --output foo/

pipetask run --pipeline MetadataTest.yaml --butler-config foo/repo/ --input "HSC/raw/all,HSC/defaults" --output "output" --data-query "instrument = 'HSC' and exposure = 59150 and detector = 50" --register-dataset-types

It should produce a conflicting definition error at graph generation.

parejkoj · January 12, 2022, 8:50pm

Just to double check this: does this mean that we cannot use a pipeline version newer than the previous weekly to work with /repo/dc2 on lsst-devl?

timj · January 12, 2022, 8:56pm

New software can talk to an old repo. If you run a pipeline with modern software that creates a TaskMetadata metadata, then v23 clients will not be able to use that dataset type or read that dataset. There is no way to work around that since v23 has no idea that lsst.pipe.base.TaskMetadata can possibly exist. That shouldn’t be a problem because it will only affect newly-created dataset types so won’t clash with any of the dataset types used for dp0.2.

If that dc2 repo is going to have some of the dp0.2 tests run on it then it will be done with v23 so that won’t be a problem.

I don’t know what the DP0.2 strategy is for the DP community in the summer. Will they be using v24 or will they be using v23 because that was what was used to create dp0.2?

parejkoj · January 12, 2022, 9:51pm

Note: my question above was because of an error I received due to the daf_base branch of DM-33155 not having been merged. @timj just fixed that and I can now successfully run my new class and the latest main on /repo/dc2.