Recreating the LSST Science pipeline tutorial (gen 2) only using Generation 3 command line tasks and the pipetasks

fsklich · August 4, 2021, 3:53pm

Good deal Tim. I’m going to work with notebook 4, as I enjoy using jupyter notebooks, since I can run a few lines, look at some objects, then repeat.
Thanks, Fred, Dallas tx

fsklich · August 4, 2021, 9:14pm

Basic question. For this tutorial ( see below), is there a place where I can snag this as a .ipynb jupyter notebook?

Gen3run-V22.html

joshuakitenge · August 4, 2021, 9:25pm

here it is
Gen3 run-v22.ipynb (680.7 KB)

timj · August 4, 2021, 9:39pm

Today I merged a fix for some of the slow down with composite datasets and S3. Turns out that we weren’t caching the boto3 client and so every single get was taking far longer than it should have done. Should be in the weekly coming out tonight.

fsklich · August 5, 2021, 2:57pm

I’m continuing with my gen3 execution of the pipeline proceses.
I’m following steps in Gen3run-v22.html. I’ve successfully registered the HSC instrument:
butler register-instrument GEN3_run/ lsst.obs.subaru.HyperSuprimeCam
Now I’m working on attempting the butler import with many errors:
butler import GEN3_run/ ~/lsst_stack/DATA_gen3/ --export-file exports.yaml
I believe I may need to “hack” a convert from the gen2 to gen3 repository. I have reincarnated the gen2 repository per the original tutorial.
I need an example of the butler convert command. I’ve fabricated this:
butler convert –-gen2root /Users/fredklich/Downloads/lsst_stack/testdata_ci_hsc /Users/fredklich/Downloads/lsst_stack/GEN3_run
I’m not clear on these points:
Are we needing to JUST reincarnate the Ref Catalogs from gen2 or
Both the gen2 HSC repository data and the Ref Catalogs.
Can you provide an example of the above butler convert command.
Thanks, Fred, Dallas, TX

joshuakitenge · August 5, 2021, 3:21pm

You can find how to use the butler convert command in this document

fsklich · August 5, 2021, 3:30pm

thanks Joshua, I DID look there, but did not realize that some of the commands actually do have EXAMPLES. Great.

fsklich · August 5, 2021, 4:54pm

Okay, Joshua, I did make significant progress. Running the convert, after 5 minutes or so (seemingly at the very end), I get these messages:
convertRepo INFO: Defining HSC/defaults from chain [‘HSC/raw/all’, ‘refcats’, ‘HSC/calib’, ‘skymaps’, ‘HSC/masks’].
Ends with this:
lsst.daf.butler.registry._exceptions.MissingCollectionError: No collection with name ‘HSC/masks’ found.
My command is:
butler convert --gen2root /Users/fredklich/Downloads/lsst_stack/testdata_ci_hsc/DATA --calibs /Users/fredklich/Downloads/lsst_stack/testdata_ci_hsc/DATA/CALIB /Users/fredklich/Downloads/lsst_stack/GEN3_run

I’m doing my best to fill in some of the gaps. Not sure I can discern a way to continue on with gen3.

joshuakitenge · August 5, 2021, 7:19pm

You should still be able to export and import the refcats from the converted repo you just created. The guide to do this is in the tutorial that a posted before. To continued with the workflow from GEN3_run-v22. The fact that the HSC/masks wasn’t found shouldn’t affect refcats.

fsklich · August 5, 2021, 10:22pm

thanks Joshua. If I assume that my convert is g2go, I have attempted to move forward [FINALLY] with butler import +++++ command. This fails with FileNotFoundError: [Errno 2] No such file or directory: ‘./DATA_gen3/export.yaml’
I have soooo many gen3 butler notes and documents that I’m not sure I’m getting any value here. I am jumping from ad-hoc fixitup step to another. I don’t mind this, because, I can learn a lot from this.
After the convert, I see these directories/files in my GEN3_run directory:
-rw-r–r-- 1 fredklich staff 756 Aug 5 11:15 butler.yaml
drwxr-xr-x 4 fredklich staff 128 Aug 5 11:17 HSC
drwxr-xr-x 3 fredklich staff 96 Aug 5 11:22 skymaps
drwxr-xr-x 3 fredklich staff 96 Aug 5 11:22 refcats
-rw-r–r-- 1 fredklich staff 1061085184 Aug 5 11:22 gen3.sqlite3
…even the refcats directory has the Pan Starrs files, making me wonder if I still need to do a import step to import the refcats.
Does this look correct to you?
I’m happy to scratch and start over, hoping that I can follow a chronology that goes more smoothly. Should I wait until gen3 is more stable?
Many thanks, again. Fred, Dallas, tx
PS - I’m very grateful for your attempts to help me.

timj · August 5, 2021, 10:44pm

butler import requires that you have exported datasets from some other place. butler convert converts gen2 to gen3 and all the files are now in a gen3 butler. butler query-collections on that converted repo will tell you what collections you have as will butler query-datasets on one of those collections.

fsklich · August 5, 2021, 10:46pm

okay, Tim, I’ll take a look.
Thanks

fsklich · August 5, 2021, 10:56pm

Well, it’s encouraging. both querys show real/substantial data:
butler query-collections ./GEN3_run
Name Type Definition
HSC/raw/all RUN
HSC/calib CALIBRATION
HSC/calib/unbounded RUN
HSC/calib/curated/19700101T000000Z RUN
HSC/calib/curated/20130131T000000Z RUN
HSC/calib/curated/20140403T000000Z RUN
HSC/calib/curated/20140601T000000Z RUN
HSC/calib/curated/20151106T000000Z RUN
HSC/calib/curated/20160401T000000Z RUN
HSC/calib/curated/20161122T000000Z RUN
HSC/calib/curated/20161223T000000Z RUN
skymaps RUN
refcats/gen2 RUN
HSC/calib/20130617T000000Z RUN
HSC/calib/20131103T000000Z RUN
HSC/calib/20141112T000000Z RUN
HSC/calib/20140714T000000Z RUN
refcats CHAINED [refcats/gen2]
HSC/defaults CHAINED
and the query-datasets shows a plethora of data.

Soooo, Tim, does the above suggest that I have a golden gen3 butler and don’t need to perform a butler import? Sorry, I sound like I’m going around in circles here…

timj · August 5, 2021, 11:11pm

It all depends on what you are trying to do. That is a gen3 repository that has the data in it that was in your gen3 repo. If you are going through the gen2 tutorial that used that repo then there are going to be gen3 variants of those commands that are going to work. The gen3 equivalent tutorials don’t exist yet so you might have to work things out from the other documentation for now. If you want to process some data take a look at the pipelines_check repository – the bin/run_demo.sh takes you through the steps of reading in some data and running the processCcd pipeline on it. You should be able to modify those commands for your needs or even just clone the pipelines_check repo and run the steps in that demo script yourself.

Importing data from a butler export is also fine but you’d need a repo to export from.

fsklich · August 6, 2021, 10:25am

To recap: what I was trying to do was to affix a proper reference catalog for my gen3 pipeline. So, while I believe my convert of my gen2 butler to a gen3 butler seems g2go, I’m still not clear on where my reference catalog is. I have a /refcat/ directory with .fits content, but (again), I’m not sure if it’s useable.
So, okay, I will follow your suggestions. I’ve learned a lot, thanks to you and Joshua.
I plan to attend the CW all next week, then go back to it the week after, while watching for the gen3 tutorials to evolve. Plus, I may fill in more pieces from listening diligently next week.
many thanks Tim.

raphaelshirley · August 6, 2021, 1:06pm

We have found that in gen 3 the ref cats converted from a gen 2 repo do seem to ingest with

butler import

along with an export.yaml file. This was originally failing bur @joshuakitenge tells me it can be fixed by specifying the refcats as an input. e.g.:

pipetask run  -b GEN3_run/ --input HSC/raw/all, refcats --register-dataset-types -p "${PIPE_TASKS_DIR}/pipelines/DRP.yaml"#processCcd \
    --instrument lsst.obs.subaru.HyperSuprimeCam --output-run demo_collection -c isr:doBias=False -c \
    isr:doBrighterFatter=False -c isr:doDark=False -c isr:doFlat=False -c isr:doDefect=False

joshuakitenge · August 13, 2021, 10:04am

Hi again, I have finally managed to do a whole run of LSST science pipeline tutorial only using gen3 command line tasks and pipetasks. The hacks I have used to get this to work are described in the document below

Gen3 run-w_2021_32.html (1.4 MB)

natelust · August 13, 2021, 5:20pm

@joshuakitenge I’m terribly sorry and can’t apologize enough for not getting back to you sooner. I started preparing something for you and then was out of town for a long weekend and it slipped my mind by the time I got back. I am working up a write up based on your html file that I will share here for everyone, hopefully today.

natelust · August 13, 2021, 6:58pm

Q: Are the configs correct?

A: Correct in what sense? The configs can be whatever you want them to be for your processing. If you mean are these the configs that are normally run with HSC data, then the answer is no.

To expand on this a bit further, I would encourage you to use the DRP pipeline that is specialized for HSC processing rather than the generic pipe_tasks version. That can be found at ${OBS_SUBARU_DIR}/pipelines/DRP.yaml and you can restrict the processing with the same #processCcd labeled subset.

This pipeline takes the generic pipeline as an import and then customizes it for HSC processing. While it is possible to apply all the same config changes yourself, it becomes quite daunting and makes your command line unmanageable.

Additionally, there is no need to specify the instrument on the command line when using the obs_subaru version, as that is defined as part of the pipeline. By specifying the instrument, you are letting the system know to apply any instrument specific overrides to tasks automatically. So under normal situations, the only configs you need to specify are things you want to have different than standard processing.

You can specify those config difference on the command line as you have, or you can also create your own pipeline which imports the obs_subaru pipline, and add customization to that, which would make it easier to version control and share. If you would like to learn more about the pipeline system and how it works you can read about it at
https://pipelines.lsst.io/v/weekly/modules/lsst.pipe.base/creating-a-pipeline.html

I would caution that unlike gen2 where things were very ridged in terms of run single frame, coaddition, multiband, etc gen3 is much more flexible and diverse. In general “single frame processing” is not just processCcd anymore. The subset name is kept as a stand in for running the three tasks that were run as part of that in gen2. Gen3 execution is based around graphs of datasets, and as such new tasks may be added quiet easily. For instance “single frame processing” now encompasses things like grouping making visit level tables of all individual exposures that were processed, and creating and transforming source tables into parquet tables. These will be used in down stream processing. This holds for the below groupings as well. In fact tasks can be grouped in more than one way as subsets are just an alias that means “run these tasks”. You may to well to run run the whole pipeline end to end. As we are transitioning to gen3 there are a few steps that can’t be run end to end in some cases (like running FGCM) or it may be difficult for machines or people to hold it all in their head. The obs_subaru pipeline defines subsets called step1, step2 etc that can be run.

That of course is if you want to run all the tasks that have been created up to this point and will be used in normal processing. If you want to stick to just what the old demo did, you have run the correct subsets. However they can still be run end to end by specifying your pipeline as DRP.yaml#processCcd,coaddition,multiband . This way you do not need intermediate collections etc, just one command to run all these tasks.

S: Before you can run the coaddtions pipe task you have to run the make-discrete-skymap command line task

A: That command should be done prior to doing any processing actually. In general this would have been done for any standard butler you connect to.

S: coadditions pipetask

A: Same notes as above with relation to instrument, pipeline to run and config values.

S: The “assembleCoadd:doMaskBrightObjects=False” wasn’t needed when i ran this test before.

A: Not sure about this one, the setting will turn off applying bright object masks during coaddition. This will be needed if you do not have any bright object masks ingested into your butler. Turning this off will remove the dataset type from attempting to be loaded, and you will not need your temporary fix. In general that is there to ensure that if the task author intended data to be present, and it isn’t then the task will not run, rather than finding out there will be an error later on. It should be left as is (withstanding your experimentation of course). Turning off the doBrightObjectMask is the task authors way of letting you NOT have masks and not apply them.

However, I know that you disabled it because there was an issue with a downstream task. This is caused by MeasureMergedCoaddSourcesTask default configuration that has the BRIGHT_OBJECT mask plane in it. This mask plane will only be added if you do in fact mask bright objects. This can be altered in configuration by making sure “BRIGHT_OBJECT” is not in: measure:.measurement.plugins["base_PixelFlags"].masksFpCenter and measure:config.measurement.plugins["base_PixelFlags"].masksFpAnywhere. You can have either a config file, or a pipeline that removes BRIGHT_OBJECT from each of those (they should be able to be manipulated like a normal python list.

I agree this is not obvious, and difficult to track down if you do not work with these tasks often. Pipelines have a feature call contracts that allows us to do some cross task config validation prior to anything running, and print out a useful message if one fails. I have created a ticket to create a contract to validate these config values so in the future no one else is bitten by this.

Q: Potential issue 1

A: That is not actually an issue, but a code execution forking path depending on the ability to do so. This message should not be printed out, and certainly not as many times as it is. There is a ticket existing to fix this issue already.

Q: Potential issue 2

A: This absolutely should not happen, and we have no issue in our other processing using makeSource. I am not sure the exposureIdInfo will do the right in in all cases. However at the moment I am going to need to look into this more, as the answer is not obvious to me. I do know it is likely from how you setup your butler initially, as that is there those expBits comes from. I suspect you need to specify the collections for define-visits to look at when running that command with the --collections argument which I think is going to be something like DATA_gen3/HSC/raw/, but you can look at butler querry-collections to be sure.

Notes:

I would STRONGLY STRONGLY recommend against looking at the file system paths of the outputs from processing. This is abstracted to support may various back ends where no file system may even be present. Additionally it will guard against changes to these paths if the datastore changes where it decides to put the files. Please get used to interacting with data through the butler with the python or command line api. If you need to get the location of a dataset to supply it to some other program, use the getURI method through the butler. This will supply you with a ButlerURI object from which you can get a path or use as a file handle to load/save the object. This will be extremely important when there are for instance S3 backed.

This is also important because dataset are not necessarily organized as you might expect from gen2. Data-products are associated in things called collections, which are roughly akin to reruns in gen2. However unlike gen2 there are no links (since no file system) between collections, instead the information on the relation between collections is all part of the butler. This means that if you load up a butler and say “list all the data in this collection”, if it is associated with other collections (whatever was an input when processing that data for instance) it will be a large list. When you look at that same “collection” on disk not all the same data will be w/n the file structure. There is a lot more to collections, but at this point I hope to use it as a user beware prompt.

If I missed anything I’m sorry just ping me and I will address it.

natelust · August 13, 2021, 7:29pm

As a follow up, more info on collections can be found here: Organizing and identifying datasets — LSST Science Pipelines