Recreating the LSST Science pipeline tutorial (gen 2) only using Generation 3 command line tasks and the pipetasks

timj · August 5, 2021, 11:11pm

It all depends on what you are trying to do. That is a gen3 repository that has the data in it that was in your gen3 repo. If you are going through the gen2 tutorial that used that repo then there are going to be gen3 variants of those commands that are going to work. The gen3 equivalent tutorials don’t exist yet so you might have to work things out from the other documentation for now. If you want to process some data take a look at the pipelines_check repository – the bin/run_demo.sh takes you through the steps of reading in some data and running the processCcd pipeline on it. You should be able to modify those commands for your needs or even just clone the pipelines_check repo and run the steps in that demo script yourself.

Importing data from a butler export is also fine but you’d need a repo to export from.

fsklich · August 6, 2021, 10:25am

To recap: what I was trying to do was to affix a proper reference catalog for my gen3 pipeline. So, while I believe my convert of my gen2 butler to a gen3 butler seems g2go, I’m still not clear on where my reference catalog is. I have a /refcat/ directory with .fits content, but (again), I’m not sure if it’s useable.
So, okay, I will follow your suggestions. I’ve learned a lot, thanks to you and Joshua.
I plan to attend the CW all next week, then go back to it the week after, while watching for the gen3 tutorials to evolve. Plus, I may fill in more pieces from listening diligently next week.
many thanks Tim.

raphaelshirley · August 6, 2021, 1:06pm

We have found that in gen 3 the ref cats converted from a gen 2 repo do seem to ingest with

butler import

along with an export.yaml file. This was originally failing bur @joshuakitenge tells me it can be fixed by specifying the refcats as an input. e.g.:

pipetask run  -b GEN3_run/ --input HSC/raw/all, refcats --register-dataset-types -p "${PIPE_TASKS_DIR}/pipelines/DRP.yaml"#processCcd \
    --instrument lsst.obs.subaru.HyperSuprimeCam --output-run demo_collection -c isr:doBias=False -c \
    isr:doBrighterFatter=False -c isr:doDark=False -c isr:doFlat=False -c isr:doDefect=False

joshuakitenge · August 13, 2021, 10:04am

Hi again, I have finally managed to do a whole run of LSST science pipeline tutorial only using gen3 command line tasks and pipetasks. The hacks I have used to get this to work are described in the document below

Gen3 run-w_2021_32.html (1.4 MB)

natelust · August 13, 2021, 5:20pm

@joshuakitenge I’m terribly sorry and can’t apologize enough for not getting back to you sooner. I started preparing something for you and then was out of town for a long weekend and it slipped my mind by the time I got back. I am working up a write up based on your html file that I will share here for everyone, hopefully today.

natelust · August 13, 2021, 6:58pm

Q: Are the configs correct?

A: Correct in what sense? The configs can be whatever you want them to be for your processing. If you mean are these the configs that are normally run with HSC data, then the answer is no.

To expand on this a bit further, I would encourage you to use the DRP pipeline that is specialized for HSC processing rather than the generic pipe_tasks version. That can be found at ${OBS_SUBARU_DIR}/pipelines/DRP.yaml and you can restrict the processing with the same #processCcd labeled subset.

This pipeline takes the generic pipeline as an import and then customizes it for HSC processing. While it is possible to apply all the same config changes yourself, it becomes quite daunting and makes your command line unmanageable.

Additionally, there is no need to specify the instrument on the command line when using the obs_subaru version, as that is defined as part of the pipeline. By specifying the instrument, you are letting the system know to apply any instrument specific overrides to tasks automatically. So under normal situations, the only configs you need to specify are things you want to have different than standard processing.

You can specify those config difference on the command line as you have, or you can also create your own pipeline which imports the obs_subaru pipline, and add customization to that, which would make it easier to version control and share. If you would like to learn more about the pipeline system and how it works you can read about it at
https://pipelines.lsst.io/v/weekly/modules/lsst.pipe.base/creating-a-pipeline.html

I would caution that unlike gen2 where things were very ridged in terms of run single frame, coaddition, multiband, etc gen3 is much more flexible and diverse. In general “single frame processing” is not just processCcd anymore. The subset name is kept as a stand in for running the three tasks that were run as part of that in gen2. Gen3 execution is based around graphs of datasets, and as such new tasks may be added quiet easily. For instance “single frame processing” now encompasses things like grouping making visit level tables of all individual exposures that were processed, and creating and transforming source tables into parquet tables. These will be used in down stream processing. This holds for the below groupings as well. In fact tasks can be grouped in more than one way as subsets are just an alias that means “run these tasks”. You may to well to run run the whole pipeline end to end. As we are transitioning to gen3 there are a few steps that can’t be run end to end in some cases (like running FGCM) or it may be difficult for machines or people to hold it all in their head. The obs_subaru pipeline defines subsets called step1, step2 etc that can be run.

That of course is if you want to run all the tasks that have been created up to this point and will be used in normal processing. If you want to stick to just what the old demo did, you have run the correct subsets. However they can still be run end to end by specifying your pipeline as DRP.yaml#processCcd,coaddition,multiband . This way you do not need intermediate collections etc, just one command to run all these tasks.

S: Before you can run the coaddtions pipe task you have to run the make-discrete-skymap command line task

A: That command should be done prior to doing any processing actually. In general this would have been done for any standard butler you connect to.

S: coadditions pipetask

A: Same notes as above with relation to instrument, pipeline to run and config values.

S: The “assembleCoadd:doMaskBrightObjects=False” wasn’t needed when i ran this test before.

A: Not sure about this one, the setting will turn off applying bright object masks during coaddition. This will be needed if you do not have any bright object masks ingested into your butler. Turning this off will remove the dataset type from attempting to be loaded, and you will not need your temporary fix. In general that is there to ensure that if the task author intended data to be present, and it isn’t then the task will not run, rather than finding out there will be an error later on. It should be left as is (withstanding your experimentation of course). Turning off the doBrightObjectMask is the task authors way of letting you NOT have masks and not apply them.

However, I know that you disabled it because there was an issue with a downstream task. This is caused by MeasureMergedCoaddSourcesTask default configuration that has the BRIGHT_OBJECT mask plane in it. This mask plane will only be added if you do in fact mask bright objects. This can be altered in configuration by making sure “BRIGHT_OBJECT” is not in: measure:.measurement.plugins["base_PixelFlags"].masksFpCenter and measure:config.measurement.plugins["base_PixelFlags"].masksFpAnywhere. You can have either a config file, or a pipeline that removes BRIGHT_OBJECT from each of those (they should be able to be manipulated like a normal python list.

I agree this is not obvious, and difficult to track down if you do not work with these tasks often. Pipelines have a feature call contracts that allows us to do some cross task config validation prior to anything running, and print out a useful message if one fails. I have created a ticket to create a contract to validate these config values so in the future no one else is bitten by this.

Q: Potential issue 1

A: That is not actually an issue, but a code execution forking path depending on the ability to do so. This message should not be printed out, and certainly not as many times as it is. There is a ticket existing to fix this issue already.

Q: Potential issue 2

A: This absolutely should not happen, and we have no issue in our other processing using makeSource. I am not sure the exposureIdInfo will do the right in in all cases. However at the moment I am going to need to look into this more, as the answer is not obvious to me. I do know it is likely from how you setup your butler initially, as that is there those expBits comes from. I suspect you need to specify the collections for define-visits to look at when running that command with the --collections argument which I think is going to be something like DATA_gen3/HSC/raw/, but you can look at butler querry-collections to be sure.

Notes:

I would STRONGLY STRONGLY recommend against looking at the file system paths of the outputs from processing. This is abstracted to support may various back ends where no file system may even be present. Additionally it will guard against changes to these paths if the datastore changes where it decides to put the files. Please get used to interacting with data through the butler with the python or command line api. If you need to get the location of a dataset to supply it to some other program, use the getURI method through the butler. This will supply you with a ButlerURI object from which you can get a path or use as a file handle to load/save the object. This will be extremely important when there are for instance S3 backed.

This is also important because dataset are not necessarily organized as you might expect from gen2. Data-products are associated in things called collections, which are roughly akin to reruns in gen2. However unlike gen2 there are no links (since no file system) between collections, instead the information on the relation between collections is all part of the butler. This means that if you load up a butler and say “list all the data in this collection”, if it is associated with other collections (whatever was an input when processing that data for instance) it will be a large list. When you look at that same “collection” on disk not all the same data will be w/n the file structure. There is a lot more to collections, but at this point I hope to use it as a user beware prompt.

If I missed anything I’m sorry just ping me and I will address it.

natelust · August 13, 2021, 7:29pm

As a follow up, more info on collections can be found here: Organizing and identifying datasets — LSST Science Pipelines

joshuakitenge · August 16, 2021, 9:53am

Cheers for this, this was extremely useful .

Doesn’t the processccd step need to be done before, as the make-discrete-skymap command needs the calexp to create the skymaps ? Well, in my instance it didn’t allow me to run the command until processccd step was completed.

fsklich · August 16, 2021, 12:41pm

Hello Joshua, Thanks for the update. I’ve returned after doing the virtual attendance at the PCW.
I am now planning to repeat my attempt at running the Gen3 pipeline. I have two questions:
1. I will upgrade from weekly 31 to weekly 32 with: eups distrib install -t w_2021_32 lsst_distrib . Correct?
2. I wish to ask if you can confirm the latest Jupyter notebook I should use to repeat the Gen3 pipeline.
3. I hope to do better in the “hack” step to get my ref catalog established. Maybe I understand it better.
Please offer any comments you may have.
Many thanks, Fred, Dallas, Tx

joshuakitenge · August 16, 2021, 1:06pm

Hi

You are going to need to download the week 32 newinstall script (curl -OL https://raw.githubusercontent.com/lsst/lsst/w.2021.32/scripts/newinstall.sh) before updating to week 32 ( eups distrib install -t w_2021_32 lsst_distrib).

This version is the latest version, and I’ve attached the .ipynb version below aswell.

Gen3 run-w_2021_32.ipynb (882.6 KB)

I’m going to be uploading a newer version in the next couple of days, encompassing the knowledge than I’ve gain from this post below

fsklich · August 16, 2021, 1:58pm

Okay, Joshua, thanks. I am examining the version 32 of the jupyter notebook, hoping the “hack” steps confirm my expectations on the detailed steps to successfully get through these steps. So, at this time I may hit the “pause” button in anticipation of your newer version.
Since I’m still a newbie, the hopes that the tutorial runs clean helps me a lot, while, if it has a few hiccups, THAT also helps me learn. So, again, thanks for your help.
I’ll watch for your updates.
Fred, Dallas, Tx
PS - are you up in Glascow area?

price · August 16, 2021, 2:18pm

You shouldn’t need to re-install the pipeline from scratch in order to install a new version.

fsklich · August 16, 2021, 2:25pm

thank you, I did understand that I would not have to re-install the full pipeline…

joshuakitenge · August 16, 2021, 2:38pm

natelust · August 16, 2021, 2:38pm

Oh you are entirely correct, I was not reading closely enough and didn’t process the “discrete” part, only internalizing the make a skymap part. Yes you need calexps for that.

You may consider using the register-symap command with a config file so you can define the skymap upfront and run the pipeline end-to-end. There is not much of a difference for this particular case, but it can be nice to run things in a similar manor to when you are using an existing butler. For instance you would not re-run the discrete command on any subsequent reprocessings of the same data.

Having the skymaps defined ahead allows you to do things such as saying “run this pipeline for tract 10 patch 20” It will figure out what raws will be needed to complete that request and generate a restricted execution graph. Like I said this wont make much of a difference for what you are running, but if you took the same commands written above and attempted to run on a butler with HSC DR2, it would attempt to process ALL the raws it found (which may be indeed what you want).

So it really depends on what you want to get out of what you are doing now. Is it something for people doing quick processing on their laptops, or something to get familiar with commands they can then use elsewhere. If the later, it would be helpful to have things split up into “here is how to setup a mock butler” step, and a “process some data” step.

If you are interested in a skymap config file, you can look at the one we use here, and of course continue to ask questions and we can help.

ktl · August 16, 2021, 3:03pm

It is safest to always use the appropriate newinstall.sh, but in many cases that file has not changed, and no new installation is required. Messages will be posted on Community (e.g. Deploying rubin-env 0.7.0) when a new version appears.

fsklich · August 16, 2021, 4:03pm

Okay, Joshua, I’m back to where I was 11 days ago. I’m at the tutorial “Hack” step again:
Importing the refcats into the butler repository:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I used a hack to get the refcats into this butler repository
- I used the butler convert command line task to convert the gen 2 repository (LSST Science pipeline tutorial) to a gen 3 repository
- Then I used the butler import command link task to import into the new butler repository (GEN3_run)
Question - Is there a way of ingestiing the refcats in a gen 3 format ??
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I have successfully completed the butler convert command. However, while I totally do not understand the need to perform a butler import, the import command fails, not finding the .yaml file.

If I’ve already converted the DB to a gen3 version, do I have to do a fresh export so that I can THEN do a import to my Gen3 system?
If I DO have to perform a fresh export, do I have to export everything, raws, calibrations, reference cat., etc…
If another developer really owns these answers, I appreciate any/all help. Perhaps I need to stand back until the tutorial can work.

timj · August 16, 2021, 4:25pm

This is all being done by @parejkoj (see my comment above).

The current plan is to rewrite refcat creation such that it is independent of butler. It will create the files as normal and also create an index file. Then we would just run butler ingest-files on the converted files with that index.

There is also an open ticket for writing a little helper function that will just do the refcat part of gen2to3 conversion and bypass the export/import step.

If you are going down the route of converting a gen2 repo to gen3 solely for refcat (and then abandoning it) you would have to export the refcats from this new gen3 repo to get the export yaml file that you would then use for the later import to the new repo.

You can though use the converted repo as a starting point and use that. The tutorial steps should still work since you can still import the data you want for it and register any new instruments (if needed). Gen3 can support multiple instruments (unlike gen2) and also has a more flexible collection management system.

fsklich · August 16, 2021, 8:15pm

Okay, Tim, thanks for the affirmation. It tells me my thinking is on the right track.

I have done the convert from gen2 to gen3.
I have to perform an export of the refcats [this will create a exports.yaml file] .
I then have to perform an import [referencing the exports.yaml created in step 2.]
Then continue with ingesting, defining visits, etc…and other steps.
Hopefully, somehow, I’m staying on the right track here.
Can you advise the CLI for the export step? , or, should I attempt something like in python:
export.saveDatasets(butler.registry.queryDatasets(collections=‘refcats’,
** datasetType=‘ps1_pv3_3pi_20170110’))**
Many thanks,
Fred, Dallas, Tx

timj · August 16, 2021, 8:40pm

Yes. There is no butler export command at the moment because the user interface can be very general.

See though something like:

github.com

lsst/pipelines_check/blob/master/bin.src/exportGraphInputs.py#L47-L59

    
      
          with butler.export(directory=args.output, format="yaml", transfer="auto") as export:
              items = []
              for quantum_node in graph:
                  for datasetType, refs in quantum_node.quantum.inputs.items():
                      if datasetType.name not in dataset_types_to_exclude:
                          # The quantum graph may not know the ID of the
                          # real dataset so convert to a real ref
                          for ref in refs:
                              found = set(butler.registry.queryDatasets(datasetType.name, collections=...,
                                                                        dataId=ref.dataId))
                              items.extend(found)
              export.saveDatasets(items)
              export.saveCollection("HSC/calib")

and you can see that you are pretty close. There is the with butler.export and then the saveDatasets – saving refcats collection might also be a good idea if you haven’t made it already.