Shared Gen3 data repositories ready for (some) use

The long-term shared Gen3 data repositories at NCSA described by RFC-741 and DMTN-167 are now up and ready for users (at least I think so, and it’s time to test that assertion).

They are not yet complete, but there is enough there already that some at least users should be able to switch their daily work to these data repositories. I also have plenty of documentation work to do before we can declare RFC-741 done, but I figure it’s better to open up these now than to block usage on documentation.

All of these repos have u/<user> subdirectories (corresponding to u/<user> collections) that users can write to, with other directories read-only for everyone except me and (in one case) @madamow. I can open up other directories to others as we go; I certainly expect to for e.g. calibrations. And hopefully I’ll eventually have time to put some database-side guardrails on database-only entities; for now, just continue to be careful to delete or modify only collections that start with u/<user>.

The data repositories and their current status and contents are outlined below, but please feel free to poke around; I highly recommend

$ butler query-collections /repo/main --chains=tree | less

/repo/main

This data repository will hold essentially all data from all real instruments (the only known exception is what’s in /repo/ccso).

Right now it has essentially all of the HSC data at NCSA - PDR2 and a few special programs (a small amount of calibration data failed to ingest for reasons that have been diagnosed but not fixed). There are two suites of master calibrations (HSC/calib/gen2/20180117 and HSC/calib/gen2/20200115), with the former marked as the default (it was in Gen2) and usable via just HSC/calib (aside: should the later calibration suite actually be the default?).

There are also special collections for the heavily used RC2 subset; using HSC/RC2/defaults as your input collection should cover everything you need for regular DRP processing, and if you’re processing the whole thing, it should make passing a visit constraint on the command-line unnecessary.

The converted w_2021_02 and w_2021_06 Gen2 RC2 runs are present as well - I’ll do w_2021_10 shortly, now that (I think) it’s done.

Eventually /repo/main will include data from LSST hardware and DECam as well; the former will be ingested after the filesystem reorganization scheduled for Thursday morning, and I’ll get started on ingesting DECam data later this week as well.

/repo/dc2

This includes all of the DESC DC2 DR6 WFD raws and processing that will be used in DP0.1 (it’s a clone of the IDF repo - or rather the converse). That includes the original calibs used for the DESC processing. The dataset processed approximately monthly by DM is a subset of this, and there are two TAGGED collections that contain the raws for these important subsets:

  • 2.2i/raw/DP0: raws for DP0 (currently everything, but this will stay the same even if we add more raws)
  • 2.2i/raw/test-med-1 (DM monthly processing subset)

The full DESC-run DR6 WFD processing can be found in the 2.2i/runs/DP0.1 collection.

/repo/teststand

This contains raw data from the simulated NCSA teststands.

/repo/ccso

This is where alternate versions of LSST raws (written by the CCS with controller=‘O’) will land. At least I think that’s what it is; I’m confident that the people who care about this repository already know more about its contents than I do. So far it just contains instrument registrations.

Are the /u/ repos protected from each other? Or can one user in principle wipe out someone else’s ?

The u/ files and directories should be protected from each other, and that should prevent most accidental deletion of database-only per-user things, because we’ll roll back database transactions when the associated filesystem deletions fail. But will still be possible to delete or modify other people’s database content in rare cases (mostly CHAINED, TAGGED, and CALIBRATION collections, which have no direct filesystem counterparts). And direct SQL write operations are very dangerous (and very much disallowed at a policy level), because those don’t have the filesystem-transaction-consistency guardrails.

1 Like

Hi,

I am working in the UK and don’t think I can get an NCSA account. Is there any plan to make the gen3 HSC data public or at least a small subset for testing? I might be able to ask for a special concession to get an NCSA account but I really just want to start looking at the Butler to prepare for using it with our own obs package.

Best,

Raphael.

We do not yet, to my knowledge, have a package containing HSC data in Gen3 form. But there are (at least) two ways of converting some of the Gen2 HSC data we publish into Gen3.

  1. Use validation_data_hsc and the conversion commands from the faro package.
  2. Use testdata_ci_hsc and the conversion done by the butler, instrument, curatedCalibrations, skymap, external, raws targets in the ci_hsc_gen3 SConstruct file.
1 Like