Butler specs - question re. request to put in multiple repositories

Regarding @gpdf’s butler request:

We need to understand how put()/writing works when multiple repositories are made visible through a single Butler. For get()/reading a single search order makes sense. For put() it may be desirable to support alternative destinations (local disk, user workspace, Level 3 DB) or even multiple destinations for a single put().

Having one repository is good for provenance. @ktl proposed this solution for Gregory’s butler request:

Expand configuration for a given dataset type so that it has multiple locations (template+storage+pythonType). The locations could be added to or overridden by setting the policy (we will make this configurable at butler-instantiation time, also n.b. policy files can be in the repo now). These locations will be markable as read-only, write-only, or read+write.

This would allow multiple locations within one repository. When writing, butler would write to all of the writeable destinations. Similarly, reading would read from all the readable destinations. (The C++ daf_persistence code allowed serialized reading from multiple locations. We’ll have to bring this in from the C++ code).

@gpdf, can you comment on this - is it ok?

Having talked this over with @ktl some more, we’ve evolved this idea somewhat. Here’s what I understand of the current design thinking for how multiple repositories can work. Please comment, criticize (constructively), and/or raise vociferous approval.

I’d specifically like input from @gpdf, @ktl, and @jbecla

Repository Definition

  • A Dataset is a kind of Repository (“Dataset Repository”)

  • Dataset Repositories contain all the information (e.g. provenance, policy) needed to use the dataset by itself (i.e. without a aggregator-Repository, even if it was created by an aggregator).

  • Multiple Repositories can be combined into an “Aggregator Repository”.

    • (presumably an Aggregator Repository’s contents could include Dataset Repositories as well as Aggregator Repositories. I don’t know of any use cases where an Aggregator Repository would need to include an Aggregator Repository though).
  • aggregated Repositories (e.g. a Dataset Repository) must not depend on Aggregator Repositories in any way; the Dataset Repository must contain all the data and metadata (e.g. provenance, policy, etc) that it needs to be used in a standalone way.

Input Repositories

  • Need to be labeled with an identifier that can be referred to by the dataId

  • Where the identifier is defined TBD. Maybe it’s repository metadata.

  • A specific Repository will be selectable by dataId, something like {‘repositoryLabel’: 'repositoryA'}

  • If a Repository is not specified, Butler will look in all available input Repositories, and normal Butlerish behaviors will apply (e.g. currently if multiple results are available for a given dataId Butler will raise).

  • Version (DM-4168) and Branching (DM-4520) will be supported.

Output Repositories

  • by default all puts go to all repositories

  • configuration allows more fine grained control over what puts go to what repositories

  • need to be able to specify configuration at different times including:

  • configuration (policy)

  • command line

so this could allow a FITS output repository and an HDF5 output repository to be used at the same time (could be useful for testing)?

Yes, provided that the client code provides a translation step.
For example:

  1. get a fits file
  2. load it to an AFW Image (or other in-memory object)
  3. put the AFW Image via a Butler.put
  4. a provided (non-butler) AFW-to-HDF5 serializer is used to write an HDF5 into the Repository.

Provided the right object types and serializers you probably wouldn’t even have to go through an AFW Image. But, aside from being able to say what object type and serialization formats to use, that is outside the scope of Butler.

Sorry I didn’t get this in before you posted, but “dataset” has had a different definition in the Butler world. I had been using it as “something that can be retrieved into a single Python object”, which could range from a single number up to a table of measurements. I think you are using it as “collection of files”, which is a bigger concept.

oh, right. I actually knew that, but overlooked it.

What if we call the “repository that contains datasets” a Data Repository? Otherwise, conceptually I think it still holds together.

One of our big requests is to be able to persist new kinds of outputs from tasks without modifying obs_* packages. I had an idea for this: I suggest that we have templates for kinds of data. Any data that matches an existing template can be written by just specifying a name.

For example we make a template for science exposures. These use a particular set of keys (which may vary from one obs_* package to another) and have a particular data type (lsst.afw.image.ExposureF). We can use the template to images output by the ISR task, image characterization task and/or the calibration task.

We would need templates for science exposures and their associated source catalogs, coadds and their associated source catalogs and sky maps, and so on. I would also like a template for task configuration (one that can be used by all tasks) and task metadata.

I would hope that obs_* packages would define a fairly complete set of such templates, so we rarely need to add more. With any luck some of them can be shared (defined in a base package, e.g. as part of CameraMapper) such as coadd data products. Others, such as science exposures, could be also be defined there, but will need some obs_ specific information about ID keys.

The current term “dataset type” might be a nice fit for these templates. They describe a type of data, but we can write many named instances.

A possible refinement is to allow us to define related collections of dataset types. For instance science exposures and source catalogs always go together and use the same set of ID keys. Perhaps we could have a way of defining a group of related templates that all share the same set of ID keys.

This is exactly the genre concept. Glad we’re thinking alike.