Setting in-repository policy via Butler

I’m working on creating API to allow scripts to add Policy to butler output repositories. (in DM-7777)

The ‘new’ butler API allows a repository-arguments structure to be passed to butler when loading input repositories and when creating output repositories. I’m thinking of adding a parameter to that structure that accepts a dict (or Butler.Policy instance, which extends dict) for that purpose. The new policy would get written into the repository, used while the repository is getting used as an output, and loaded when the repository is loaded as an input.

Setting the in-repository policy should only be allowed on output repositories (input repositories should not be modified, generally speaking).

The only API change would be adding a keyword to lsst.daf.persistence.RepositoryArgs:

class RepositoryArgs(object):
    def __init__(..., policy=None):
        ...

so that then a policy can be created in a script e.g.:

newPolicy = {
    'datasets': {
        'testDataset': {
            'python': 'lsst.daf.persistence.test.TestObject',
            'template': 'basic/id%(id)s.pickle',
            'storage': 'PickleStorage'
        }
    }
}

and passed to butler:

repoArgs = lsst.daf.persistence.RepositoryArgs(root='path/to/output/repo', policy=newPolicy)
butler = lsst.daf.persistence.butler(inputs='path/to/inputs', outputs=repoArgs)

People that are interested in adding policy to repositories - does this work for you? Do you have other ideas, or requirements that I’ve overlooked?

I’d specifically like +/- feedback from @ktl on this.

I’m a bit confused. As a naïve user, does this change the way that I create a butler? I currently only need to know the root to read or write it.

It will not change the way you create a butler (unless you want to add Policy details to the repository, which it sounds like you do not.)

So if someone else has configured the Policy details I’ll still get them? Sorry to be dense, but I don’t understand all the changes you are making.

You’re not being dense. Thank you for considering it.

The short answer is yes.

Slightly longer: when ‘someone else’ creates the repository (as an output repository) and adds new policy details during butler init, that policy will get written to the repo. Later, when you use that repo as an input repository, butler will load that in-repo policy that it finds in the repo.

The new butler api has not been seen much in the wild yet. (I don’t really know how to push it out & get people using it). But you can read more about it at https://ldm-463.lsst.io/v/draft/

For other Science Pipelines people wondering how this affects them, I think (confirmation from someone knows welcome) the idea is that (Super)Tasks will define their output data products in these per-repository policies rather than in the per-camera policy files we use now (or even the base-class policy file @pgee has been moving definitions to recently). Until that’s possible, I don’t think we have a direct use for this feature.

Don’t we still need a common set of definitions even if the outputs are per-repository? I’m expecting that HSC and DECam data will look basically identical to the analysis scripts.

If not, it’s even more important that @jalt arranges the data sets at NCSA so that all datasets from a given camera share a common root that the butler points at.

I think it’d be really helpful to have a relatively short document describing (or at least mentioning) the features that are in the butler that the `classic’ butler didn’t have.

That sounds reasonable. I created https://jira.lsstcorp.org/browse/DM-8080, probably I’ll do this by adding a section to LDM-463.