Recreating the LSST Science pipeline tutorial (gen 2) only using Generation 3 command line tasks and the pipetasks

joshuakitenge · August 2, 2021, 7:51am

Hi, I have redone the timing test for Local storage and ECHO storage for the LSST pipeline tutorial using generation 3 commands using the version w_2021_30 of the stack. I have also tested two configurations of CephFS. One with the registry within CephFS and the other with the registry in local storage.

The composite disassembly run was 38.4 GB.
The test run without composite disassembly was 12.4 GB.
was this increase in the data size expected ?
(processes = 1 on the pipetasks)

timj · August 2, 2021, 6:23pm

Thanks for doing this. I am confused by some of the results.

For example write-curated-calibrations should be no different with or without composite disassembly because none of the curated calibrations are disassembled and yet somehow it’s 10% slower. I may well have to implement a butler.put that allows multiple datasets to be stored at once so that the datastore can parallelize uploads.

Coaddition is going to be slow because each component is downloaded separately from S3 and then combined into a single Exposure. It’s good to see that there is no slowdown with any of the runs with disassembly that used a “local” filesystem. I think I may have to implement an asyncio parallel file retrieval option (and storage option) so that we can be sending these files to S3 simultaneously. It is interesting how make-discrete-skymap is barely any faster despite it downloading significantly less data.

Lots of separate FITS files is going to be bigger than one file, although I wasn’t expecting a factor of 3.

fsklich · August 3, 2021, 10:40pm

Rewind, reset on V22. Nate, I have just completed my second run-through of the tutorial for V22. I am now planning to repeat the process for V22 with gen3. As I understand, I need to reinstall the V22 pipeline software, THEN, apply a patch to incorporate the latest gen3 features. Is this correct? Also, are the tutorials available to guide us through this new process? Please advise. Many thanks, Fred, Dallas, tx

timj · August 4, 2021, 4:03am

No. You can use gen3 in v22. It’s 2 months out of date but it’s still possible to use it. If you are serious about gen3 though you should install a recent weekly. You don’t need to patch anything but instead of installing v22_0_1 you would install w_2021_31 (or whatever the newest is). The one caveat being you would need to use a different conda environment so when you do the newinstall step you need to use the newinstall.sh corresponding to the weekly you want to install (rather than using the v22 tag, use the w.2021.31 tag).

fsklich · August 4, 2021, 10:20am

okay, thanks Tim. I’ll explore where I can find the corresponding tutorial steps so I can be employing the latest gen3 pipeline steps. Getting there…

fsklich · August 4, 2021, 2:02pm

Tim, I ran the latest weekly version of the new install.sh:
bash ./lsst-w.2021.31/scripts/newinstall.sh -ct
I installed the latest weekly lsst software:
eups distrib install -t w_2021_31 lsst_distrib
I have a reference to an html file that I believe is intended to guide me through the latest Gen3 tutorial:
Gen3run-V22.html
Can affirm that my tutorial html doc [above] is what I should follow?
many thanks, Fred, Dallas, tx

timj · August 4, 2021, 2:40pm

That is an “unofficial” tutorial from @joshuakitenge but should be fine as a first step. There are also all the DP0.1 tutorials that I believe you’ve already located, in particular notebook 4.

The replacements for the gen2 tutorials at pipelines.lsst.io are still being worked on but vacations slow things down a bit. We are also working on documentation for how to construct pipelines.

fsklich · August 4, 2021, 3:53pm

Good deal Tim. I’m going to work with notebook 4, as I enjoy using jupyter notebooks, since I can run a few lines, look at some objects, then repeat.
Thanks, Fred, Dallas tx

fsklich · August 4, 2021, 9:14pm

Basic question. For this tutorial ( see below), is there a place where I can snag this as a .ipynb jupyter notebook?

Gen3run-V22.html

joshuakitenge · August 4, 2021, 9:25pm

here it is
Gen3 run-v22.ipynb (680.7 KB)

timj · August 4, 2021, 9:39pm

Today I merged a fix for some of the slow down with composite datasets and S3. Turns out that we weren’t caching the boto3 client and so every single get was taking far longer than it should have done. Should be in the weekly coming out tonight.

fsklich · August 5, 2021, 2:57pm

I’m continuing with my gen3 execution of the pipeline proceses.
I’m following steps in Gen3run-v22.html. I’ve successfully registered the HSC instrument:
butler register-instrument GEN3_run/ lsst.obs.subaru.HyperSuprimeCam
Now I’m working on attempting the butler import with many errors:
butler import GEN3_run/ ~/lsst_stack/DATA_gen3/ --export-file exports.yaml
I believe I may need to “hack” a convert from the gen2 to gen3 repository. I have reincarnated the gen2 repository per the original tutorial.
I need an example of the butler convert command. I’ve fabricated this:
butler convert –-gen2root /Users/fredklich/Downloads/lsst_stack/testdata_ci_hsc /Users/fredklich/Downloads/lsst_stack/GEN3_run
I’m not clear on these points:
Are we needing to JUST reincarnate the Ref Catalogs from gen2 or
Both the gen2 HSC repository data and the Ref Catalogs.
Can you provide an example of the above butler convert command.
Thanks, Fred, Dallas, TX

joshuakitenge · August 5, 2021, 3:21pm

You can find how to use the butler convert command in this document

fsklich · August 5, 2021, 3:30pm

thanks Joshua, I DID look there, but did not realize that some of the commands actually do have EXAMPLES. Great.

fsklich · August 5, 2021, 4:54pm

Okay, Joshua, I did make significant progress. Running the convert, after 5 minutes or so (seemingly at the very end), I get these messages:
convertRepo INFO: Defining HSC/defaults from chain [‘HSC/raw/all’, ‘refcats’, ‘HSC/calib’, ‘skymaps’, ‘HSC/masks’].
Ends with this:
lsst.daf.butler.registry._exceptions.MissingCollectionError: No collection with name ‘HSC/masks’ found.
My command is:
butler convert --gen2root /Users/fredklich/Downloads/lsst_stack/testdata_ci_hsc/DATA --calibs /Users/fredklich/Downloads/lsst_stack/testdata_ci_hsc/DATA/CALIB /Users/fredklich/Downloads/lsst_stack/GEN3_run

I’m doing my best to fill in some of the gaps. Not sure I can discern a way to continue on with gen3.

joshuakitenge · August 5, 2021, 7:19pm

You should still be able to export and import the refcats from the converted repo you just created. The guide to do this is in the tutorial that a posted before. To continued with the workflow from GEN3_run-v22. The fact that the HSC/masks wasn’t found shouldn’t affect refcats.

fsklich · August 5, 2021, 10:22pm

thanks Joshua. If I assume that my convert is g2go, I have attempted to move forward [FINALLY] with butler import +++++ command. This fails with FileNotFoundError: [Errno 2] No such file or directory: ‘./DATA_gen3/export.yaml’
I have soooo many gen3 butler notes and documents that I’m not sure I’m getting any value here. I am jumping from ad-hoc fixitup step to another. I don’t mind this, because, I can learn a lot from this.
After the convert, I see these directories/files in my GEN3_run directory:
-rw-r–r-- 1 fredklich staff 756 Aug 5 11:15 butler.yaml
drwxr-xr-x 4 fredklich staff 128 Aug 5 11:17 HSC
drwxr-xr-x 3 fredklich staff 96 Aug 5 11:22 skymaps
drwxr-xr-x 3 fredklich staff 96 Aug 5 11:22 refcats
-rw-r–r-- 1 fredklich staff 1061085184 Aug 5 11:22 gen3.sqlite3
…even the refcats directory has the Pan Starrs files, making me wonder if I still need to do a import step to import the refcats.
Does this look correct to you?
I’m happy to scratch and start over, hoping that I can follow a chronology that goes more smoothly. Should I wait until gen3 is more stable?
Many thanks, again. Fred, Dallas, tx
PS - I’m very grateful for your attempts to help me.

timj · August 5, 2021, 10:44pm

butler import requires that you have exported datasets from some other place. butler convert converts gen2 to gen3 and all the files are now in a gen3 butler. butler query-collections on that converted repo will tell you what collections you have as will butler query-datasets on one of those collections.

fsklich · August 5, 2021, 10:46pm

okay, Tim, I’ll take a look.
Thanks

fsklich · August 5, 2021, 10:56pm

Well, it’s encouraging. both querys show real/substantial data:
butler query-collections ./GEN3_run
Name Type Definition
HSC/raw/all RUN
HSC/calib CALIBRATION
HSC/calib/unbounded RUN
HSC/calib/curated/19700101T000000Z RUN
HSC/calib/curated/20130131T000000Z RUN
HSC/calib/curated/20140403T000000Z RUN
HSC/calib/curated/20140601T000000Z RUN
HSC/calib/curated/20151106T000000Z RUN
HSC/calib/curated/20160401T000000Z RUN
HSC/calib/curated/20161122T000000Z RUN
HSC/calib/curated/20161223T000000Z RUN
skymaps RUN
refcats/gen2 RUN
HSC/calib/20130617T000000Z RUN
HSC/calib/20131103T000000Z RUN
HSC/calib/20141112T000000Z RUN
HSC/calib/20140714T000000Z RUN
refcats CHAINED [refcats/gen2]
HSC/defaults CHAINED
and the query-datasets shows a plethora of data.

Soooo, Tim, does the above suggest that I have a golden gen3 butler and don’t need to perform a butler import? Sorry, I sound like I’m going around in circles here…