Implementing RFC 95 / Populating /datasets

With /datasets now available, DR in the works, the lsst-dev7 transition to full ops coming and NFS retirement in the near future, it is time to finalize the plan to organize camera data for project use, to reduce the new-developer ‘discovery’ curve, encourage / enable use of new capabilities, and all that good stuff.

Background information via RFC 95.

Layout for initial loading

With the help of @hsinfang and @daues, we have identified these candidate datasets for copying into /datasets as defined below. Note, we are proposing ‘copying’ data at this time, not moving, in order to not disrupt current development efforts.

We ask that the owner (or delegate, domain expert, manager, whatever), confirm the destination. NCSA is available for the data copy. Each set will require someone with proper domain knowledge for the butler-ization of each.

(if the coordination here becomes to unwieldy, we will move to Jira)

/datasets/astrometry_net_data/ (source: /lsst7/astrometry_net_data/, owner: @price)
/dataset/decam/data/ (source: /lsst8/decam, owner: @mwv)
/datasets/hsc/commissioning/ (source: /lsst3/HSC/, owner: @price )
/datasets/hsc/newhorizons/ (source: /lsst8/ctslater/nh_data minus _parent, rerun, owner: @ctslater )
/datasets/sdss/preprocessed/dr9/ (source: /lsst7/sdss/dr9/, owner: @mjuric )
/datasets/lsstSim (source: external)

These two are not verification data but I would like to make them equally accessible.
/datasets/all-sky (source: /lsst/all-sky-ASIVA, owner: Mike Fitzgerald)
/datasets/all-sky-ASIVA (source: /lsst/all-sky-ASIVA, owner: Jacques Sebag)

These two snuck in the backdoor of the new home. If this is verification data, please suggest a destination.
/datasets/gaia/ (source: /gpfs/fs0/home/ctslater/gaia_refcat, owner: @ctslater)
/datasets/ (source: /gpfs/fs0/home/fforster/???)

On Immutability

I have a plan for how we can safe-guard and even verify the integrity of these datasets. But that will be revealed in another thread. For this initial loading, we will secure by hand.

Creation / Deletion Policy

RFC-95 touched on the need for a RFC when removing ‘public’ data sets. I believe we need a formal procedure for introducing data sets as well as their layouts. Again, that is coming, likely via RFC.

This was a placeholder for the HiTS Survey data of @ctslater and Francisco (He doesn’t seem to have an account here or Slack yet?)
How about /datasets/decam/hits/ for those data?

I suggest we create /datasets/refcats, and copy:

  • /lsst7/astrometry_net_data/* --> /datasets/refcats/
  • gpfs/fs0/home/ctslater/gaia_refcat --> /datasets/refcats/gaia

The HSC move is good, thanks.

1 Like

I agree with both @hsinfang and @price’s suggestions. I would also add that we will probably want to version the Gaia catalogs. I propose calling this one gaia_DR1_v1 and creating a symlink gaia_latest which points to it.

1 Like

I note that hsc data is split up by survey/proposal, and it seems decam data is not (or it if is, that’s hidden beneath a superfluous “data” subdirectory). Should we be consistent across cameras about this?

I’d generally say that having a single data repository for all data from a camera is preferable, with the only reason not to do that data access permission issues. And if we do have datasets that are at all proprietary, perhaps those should be in a special “proprietary” subdirectory for every camera, so everything else can go in one repo?

I don’t much care how the data is laid out on disk, but I do care about where the butler points! I can’t see the /datasets directory on lsst-dev (and can’t login to lsst-dev7), so I cannot check, but I’d like to have only one root for all the data from a given camera.

I know that @price has pushed back on this for HSC due to the size of the registries, but that’s a problem that we need to solve.

The HSC data, at least, is just a dump of raw FITS files, and not (yet?) a proper butler-ized data repo.

That’s fine; as I said, I don’t much care where the data goes, but I wanted to provide guidance for when NCSA starts thinking about butlers (which will presumably be next/soon).

Everyone should have access to lsst-dev7. If you do not, send email to lsst-sysadm at ncsa illinois edu.

How do you keep track of individual programs? I’m not opposed to this, but I don’t know how I would keep track of what data is (e.g.) new horizons that I want to process vs everything else.

I believe the HSC registry has a “proposal” column that can be used to select different programs, though now that I look at the actual values in it I’m not sure I understood it correctly (@price, do you?). It also has a “field” column that’s used for fields within large programs. That’s also used in the organization of the data on disk, so even if you aren’t using the butler it’s relatively easy to find all the data in a particular area of the sky in a given program.

I think it’s terribly under-appreciated that you can specify data like:

  • --id field=M123 filter=r
  • --id proposal=12345 filter=g
  • --id dateObs=2012-03-04 filter=z

instead of just providing a list of visits.

Looks like those fields are not currently supported with Decam. I have filed DM-8069 to fix that.

See also DM-5883.

Of course, that ticket has been sitting there for a while because obs_decam has no-one to take responsibility for it.

Let me know if I can help shed light on any butler-use questions or issues.

I think the superfluous data subdirectory is our invention. It was unclear to me, in RFC-95, if ‘real’ data when in the top level directory or within a data subdirectory. For the purpose of dataset discovery, I think a sub directory for ‘real’ data makes it much easier for users to ‘discover’ the other surveys within the root directory. Does this break something?

What were you imagining for the other subdirectories in /datasets/decam/ besides data? I was thinking that all we’d have in these directories was real data, though that’s by no means the only sensible arrangement. In any case, my primary concern is that we be consistent across cameras unless we have a reason not to.

It looks like we would have:

/datasets/decam/<real data>
/datasets/decam/preprocessed/
/datasets/decam/rerun/

I wonder if there is a relationship between these directories that requires preprocessed/rerun to be in the <real data> directory? If so, then making a ‘data’ subdirectory does not make any sense.

I don’t think we want the raw data under a subdirectory of the butlerised data repo, as there’s the potential for name conflicts.

Option 1:

  • Raw data in /datasets/decam/_raw/
  • Repo in /datasets/decam/

Option 2:

  • Raw data in /datasets/decam/raw/
  • Repo in /datasets/decam/data/

I prefer option 2, as the _raw may be viewed as an implementation detail and not treated as a first-class directory with important data.

Are there other sensible options?

I was envisioning:

  • Raw data root repo: /datasets/decam
  • Preprocessed data in /datasets/decam/preprocessed

The preprocessed data would probably contain subdirectories with completely independent repos inside it, at least unless (until?) the butler can handle different versions preprocessing applied to the same data.

I was further imagining that we’d put the rerun directory within the raw data root as a subdirectory, but there’s no reason it has to go there, and I could certainly understand that it’d make permissions easier if it’s a sibling directory of the raw data repo rather than a subdirectory. The same probably applies for the calib repos.