Implementing RFC 95 / Populating /datasets

I don’t think we want completely independent datasets living under the data repo — there’s too much potential for confusion. Which directories in the data repo are part of the data repo and which are independent? Hence I suggest we keep everything separate:

  • Data root repo: /datasets/decam/repo (or similar); will contain /datasets/decam/repo/rerun for processed results.
  • Raw data (uningested into repo): /datasets/decam/raw
  • Preprocessed data: /datasets/decam/preprocessed (or .../cp for Community Pipeline?)

Perhaps we don’t understand each other because I’m assuming that we want to keep the raw data independent of the data repo and not just ingest it into it. I think that’s desirable because the raw data is sacrosanct, while the data repo can be allowed to evolve.

I don’t think that those fields are guaranteed to be present as there’s no standard for the registry schema. The HSC schema has lots of fields – but isn’t that one reason why the sqlite3 files are so large, and @price didn’t want to ingest all our HSC data into one registry?

I think that is a different starting assumption - isn’t the butler designed to be able to accept raw data as-is? Why wouldn’t we want to use that feature, especially if not using it means twice as much storage for raw data?

As for preprocessed, I think we agree that they need to be separate repos right now. But I’d like to leave the door open for having the butler/registry manage different versions of preprocessed datasets within the main repo in the future, and I think that’s easier if we make it a subdirectory.

Perhaps, but we never use it like that. Instead, raw data is always ingested into the data repo (laid out in the directory structure specified by the policy, and registry populated). That’s how we do HSC and that’s how we’ve been doing DECam. Given that the how the data in the repository is laid out may change and we’ll want to re-ingest, I believe it’s best to keep the raw data and the data repo separate.

Ingestion will usually link to the raw data, rather than copy.

@price, you’ve addressed my concerns with raw data vs. repos.

What about preprocessed?

Do you agree with my assertion that we could easily move rerun out of the main repo if that makes it easier to maintain permissions?

Agree with putting all reference catalogs under one folder. Would we want to separate the astrometry_net_data style and the new reference loader style into two subfolders? Something like this:

  • /datasets/refcats/astrometry_net_data/ (sources: /lsst7/astrometry_net_data/ ) owner: @price
  • /datasets/refcats/cal_ref_cat/gaia_DR1_v1/ (source: /gpfs/fs0/home/ctslater/gaia_refcat ) owner: @ctslater

Do I recall correctly that there are plans to make those 2MASS and SDSS astrometry_net_data into the new reference loader style? If that happens, then they go side-by-side with the GAIA catalog?

It should be in a separate directory, not under the data repo. I believe we’ve been trying to move away from caring about CP outputs, so I wonder if it’s necessary.

lsst.pipe.base.ArgumentParser is hardwired to put the rerun under the main repo, but I guess you could put a link.

I can see grouping the astrometry_net_data style catalogs together, since we’re trying to wean ourselves off it. Once we have modern replacements, we could just delete all the astrometry_net_data style catalogs at once. But once that’s gone, there’s only one catalog style left, so I don’t see the point in an extra layer of indirection then.

Sounds reasonable. So my corrected proposal is:

  • /datasets/refcats/astrometry_net_data/ (sources: /lsst7/astrometry_net_data/ ) owner: @price)
  • /datasets/refcats/gaia_DR1_v1/ (source: gpfs/fs0/home/ctslater/gaia_refcat ) owner: @ctslater
  • create a symlink /datasets/refcats/gaia_latest that points to /datasets/refcats/gaia_DR1_v1/
1 Like

My attempt to summarize the top level structure for each camera:

/datasets/<camera>/repo/ (This is where Butler root locates. It has _mapper, registry.sqlite3, links to actual data, etc)
/datasets/<camera>/repo/rerun/ (later the processed results go here.)
/datasets/<camera>/preprocessed/ (For example /datasets/sdss/preprocessed/dr9/, /datasets/decam/preprocessed/cp/, etc)
/datasets/<camera>/raw/<survey-name>/<actual raw data> (where actual files live, uningested. They don’t change with Butler template changes.)

Does this sound good to everybody?

In terms of HSC data, this means the copying will be:

  • /datasets/hsc/raw/commissioning/ (source: /lsst3/HSC/ minus price_test/ and data/, owner: @price )
  • /datasets/hsc/raw/newhorizons/ (source: /lsst8/newhorizons, owner: @ctslater )
  • All above raw data will be Butler-ingested into /datasets/hsc/repo/

This is a bit different from the original RFC-95 in the sense that the Butler root will be /dataset/<camera>/repo/ instead of /dataset/<camera>/. But that avoids potential naming conflicts. (Also it may make reingesting easier in case we change the Butler templates…)

The guideline is to have a single Butler entry of data repository for all data from a camera when possible. I don’t think DecamMapper can handle both raw and CP data in one Butler repo yet, so CP data will likely form its own repo inside preprocessed/ before DecamMapper is refactored.

Sounds like we are at a consensus. Jira story to track data movement: DM-7985. It is not clear to me how to hand-off a task in Jira; we can copy data but someone needs to butler-ize and sign off before we make it immutable. In an operations environment, we just create the ticket and assign it but I’m not sure how that flies in earned value.

Also, I foresee a RFC that is a successor to RFC 95 with these conclusions plus any policy. Upon acceptance, I assume that is documentation for developer.lsst.io.

We missed to talk to calibration data. I imagine all calibration data may live under /datasets/<camera>/calib/ and there can be multiple calibration repositories in it. A processing run can then pick whichever using the --calib command line task argument. Does that sound right?

So, maybe something like this?
/datasets/hsc/calib/commissioning/ (source: /lsst3/HSC/CALIB/ )
/datasets/hsc/calib/newhorizons/ (source: /lsst8/ctslater/nh_data/CALIB/ )

I prefer having one authoritative calib set to go with the one data repo. That’s what we do for HSC. That way, you don’t have to worry about choosing the right calibration set, because the default set covers everything. But we also want somewhere to put alternate sets. So I propose:

  • /datasets/hsc/calib/default/
  • /datasets/hsc/calib/test-20161025

etc., and point /datasets/hsc/repo/CALIB --> /datasets/hsc/calib/default so you don’t have to use --calib on the command-line.

Craig Loomis raised the same question here. While you can put a symbolic link in, I think it’d be better to remove the hard-coded assumption about the location of the rerun directory (you’ll often want it on a different disk). Of course, you may also want to make rerun/rhl a symbolic link too, so this doesn’t totally solve the problem.

I don’t think this applies for a couple reasons:

  • we want a hard coded assumption for the location of rerun directories for the sake of organization
  • in this environment, you can not target separate resources

I guess I was thinking that it’d be part of @natepease 's butler metadata. For any given dataset I totally agree that it should be fixed. But this isn’t something I deeply care about.

I think the relationship between the repositories will get captured in butler metadata either way, the repositories may exist in the filesystem in a hierarchy or not - it should not matter to butler.

Some tl;dnr about new butler, parent repositories, and multiple input & output repositories:

In new butler there is a repository configuration that lives at the root of the repo. It’s a yaml file. One of the things it contains is a list of parent repositories (in the form of a list of URI to the parent repository’s repository config file). The parent repositories get added to butler’s search path. So if you have 2 repositories A and B where A is the parent of B and you init a butler with B as an input repository, A will get added to butler’s input repositories after B. (And if A has parents listed in its config those will get added after A). When you call butler.get, butler will look in B and if the item is not found it will then look in A.

Or, you can have 2 repositories X and Y that have no implicit relationship, and init butler with both of them as inputs: Butler(inputs=(pathToX, pathToY)) and for this case, Butler.get will search X and if the item is not found it will then search Y.

In order to make sure that the layout of the LSST data (and its precursor data) is friendly to other sites where LSST software may eventually run, including but not limited to CC-IN2P3, I would like to propose that the datasets hierarchy be relative to a site-specific top level directory, for instance pointed to by a $LSST_DATA environmental variable. The default value for this variable may be /datasets of you so desire.

The motivation for this is that in a computing facility shared by several research programs (such as CC-IN2P3) it would be extremely difficult to make sure that LSST data will be under the absolute path /datasets/....

Specifically, currently at CC-IN2P3 we put all LSST-related data under the absolute path /sps/lsst. I could therefore initialise LSST_DATA=/sps/lsst/datasets.

My concern is that absolute paths get hardcoded deep into the software and that makes life really difficult for people like us operating computing facilities used by several research programs.

Rest assured that we are here discussing only about where to put data on a specific piece of hardware. This path will not and cannot be hard-coded because any code we write will also have to work on hardware at Princeton, UW, people’s laptops, IN2P3, our Jenkins cloud, etc.

1 Like

I definitely agree that absolute pathnames should not be hardcoded. However, it can be extremely useful to be able to relocate relative pathnames from site to site and not just assume that the entire filesystem organization is site-specific.

We could plan on this kind of relocatability at the filesystem organization level even while continuing to insist that no pathnames of any kind appear in the application code itself.