Naming butler repositories

The Butler now has the concept of a repository index that people can use to lookup a memorable label and be returned a URI to the relevant butler repository.

For this to work the user’s site must define the environment variable DAF_BUTLER_REPOSITORY_INDEX to point to a URI (can be file system path or S3 URI etc) of a YAML (or JSON) file containing a simple dict mapping of label to URI.

For example:

latiss: "/repo/main"
lsstcam: "/repo/main"
dc22: "/repo/dc2.2i"

You can then do something like:

from lsst.daf.butler import Butler

butler = Butler(Butler.get_repo_uri("latiss"))

and not have to remember where the recommended LATISS butler repository is located. This same code will work wherever someone has defined a default repository location for “latiss”.

You can list all known repos with:

print(Butler.get_known_repos())

None of this works though without us writing those YAML index files.

Ideally we’d ue the same label for the (conceptually) same set of data everywhere.

On IDF you could imagine a label of “dp01” and “dp02” and “dp0” with the latter changing from dp0.1 to dp0.2 when dp0.2 comes out.

At NCSA we have /repo/main as a repository with HSC and LSST data in it but the summit does not have such a thing. It therefore might make more sense for per instrument labels to point to the same repository so if you want LATISS data you’ll always end up in the best repository for LATISS.

What people choose for these labels is out of my hands but we can discuss on this ticket.

In particular I imagine @hsinfang , @yusra , @jbosch, and @merlin will have opinions on whether the same labels should work at NCSA and summit and how much IDF should conform.

Once we know the names I can make a ticket for each site for someone to implement the creation of the file and the setting of the environment variable.

2 Likes

I do indeed have very strong feelings about whether the same labels should work at NCSA and the summit - the entire point of the RFC and subsequent tickets was to provide a uniform interface for butler instantiation! If you want a different one for each site then you may as well make it the paths you remember rather than these labels :grinning_face_with_smiling_eyes:

Assuming we can all agree on the above, I will say that I don’t have any thoughts on, nor am sufficiently up on the DPX.X’s to have opinions, so will leave that to others tagged here. For the real cameras, I’d propose:

['latiss', 'lsstcam', 'lsstcomcam']

or we could go for case-matching the paths in the repos, so

['LATISS', 'LSSTCam', 'LSSTComCam']

I don’t much care which tbh. I’d lean slightly towards all lower case so you don’t have the remember the capitalisations, but am not really bothered.

On IDF there are currently 2 repos of interest, one for RSP prod environment and one for RSP int environment:

For the RSP on data.lsst.cloud:
dp01: "s3://butler-us-central1-dp01"

For the RSP on data-int.lsst.cloud:
dp02: "s3://butler-us-central1-panda-dev/dc2/butler-external.yaml"
(p.s. This won’t be the eventual DP0.2 repo for delegates in the prod environment; instead a new Postgres will be deployed in RSP’s project and a database copy will be made.)

In either case, the label can also be named dp0 or dc2 or whatever capitalization. It might be nice if the same label is used at NCSA. But I don’t have strong feelings how the labels should be named. My only opinion is that I hope only repos accessible from the environment show up as the known repos in that environment.

I think that’s a feature of the system. You can define dp02 to be whatever you want and it can change over time and can refer to different things on data.lsst.cloud and data-int.lsst.cloud.

The configuration YAML file is entirely site-specific so it only includes what you want to include. You can have generic keys and specific keys. Things like dc2 and latiss could point to whatever people think is the most general place for dc2 or latiss data at that specific site. You can of course define dp0 and dc2 and dp02 all to point to the same place but have dp01 be different (but it used to be the same place as dp0).

The main point here is that you can have many-to-one mappings of names to butler URIs and we want general names and specific names. The users can then decide whether they really really need dp02 or whether they only care about having access to some dc2 data. All the teaching notebooks for delegates will likely want to be targeting a general dp0 butler so that they work when dp01 is replaced by dp02.

This has now been implemented and since w_2022_10 you can use labels such as dp02 in place of a butler yaml in the Butler constructor and from command-line tools.