Q: Are the configs correct?
A: Correct in what sense? The configs can be whatever you want them to be for your processing. If you mean are these the configs that are normally run with HSC data, then the answer is no.
To expand on this a bit further, I would encourage you to use the DRP pipeline that is specialized for HSC processing rather than the generic pipe_tasks version. That can be found at ${OBS_SUBARU_DIR}/pipelines/DRP.yaml
and you can restrict the processing with the same #processCcd
labeled subset.
This pipeline takes the generic pipeline as an import and then customizes it for HSC processing. While it is possible to apply all the same config changes yourself, it becomes quite daunting and makes your command line unmanageable.
Additionally, there is no need to specify the instrument on the command line when using the obs_subaru
version, as that is defined as part of the pipeline. By specifying the instrument, you are letting the system know to apply any instrument specific overrides to tasks automatically. So under normal situations, the only configs you need to specify are things you want to have different than standard processing.
You can specify those config difference on the command line as you have, or you can also create your own pipeline which imports the obs_subaru pipline, and add customization to that, which would make it easier to version control and share. If you would like to learn more about the pipeline system and how it works you can read about it at
https://pipelines.lsst.io/v/weekly/modules/lsst.pipe.base/creating-a-pipeline.html
I would caution that unlike gen2 where things were very ridged in terms of run single frame, coaddition, multiband, etc gen3 is much more flexible and diverse. In general “single frame processing” is not just processCcd anymore. The subset name is kept as a stand in for running the three tasks that were run as part of that in gen2. Gen3 execution is based around graphs of datasets, and as such new tasks may be added quiet easily. For instance “single frame processing” now encompasses things like grouping making visit level tables of all individual exposures that were processed, and creating and transforming source tables into parquet tables. These will be used in down stream processing. This holds for the below groupings as well. In fact tasks can be grouped in more than one way as subsets are just an alias that means “run these tasks”. You may to well to run run the whole pipeline end to end. As we are transitioning to gen3 there are a few steps that can’t be run end to end in some cases (like running FGCM) or it may be difficult for machines or people to hold it all in their head. The obs_subaru pipeline defines subsets called step1, step2 etc that can be run.
That of course is if you want to run all the tasks that have been created up to this point and will be used in normal processing. If you want to stick to just what the old demo did, you have run the correct subsets. However they can still be run end to end by specifying your pipeline as DRP.yaml#processCcd,coaddition,multiband
. This way you do not need intermediate collections etc, just one command to run all these tasks.
S: Before you can run the coaddtions pipe task you have to run the make-discrete-skymap command line task
A: That command should be done prior to doing any processing actually. In general this would have been done for any standard butler you connect to.
S: coadditions pipetask
A: Same notes as above with relation to instrument, pipeline to run and config values.
S: The “assembleCoadd:doMaskBrightObjects=False” wasn’t needed when i ran this test before.
A: Not sure about this one, the setting will turn off applying bright object masks during coaddition. This will be needed if you do not have any bright object masks ingested into your butler. Turning this off will remove the dataset type from attempting to be loaded, and you will not need your temporary fix. In general that is there to ensure that if the task author intended data to be present, and it isn’t then the task will not run, rather than finding out there will be an error later on. It should be left as is (withstanding your experimentation of course). Turning off the doBrightObjectMask
is the task authors way of letting you NOT have masks and not apply them.
However, I know that you disabled it because there was an issue with a downstream task. This is caused by MeasureMergedCoaddSourcesTask
default configuration that has the BRIGHT_OBJECT
mask plane in it. This mask plane will only be added if you do in fact mask bright objects. This can be altered in configuration by making sure “BRIGHT_OBJECT” is not in: measure:.measurement.plugins["base_PixelFlags"].masksFpCenter
and measure:config.measurement.plugins["base_PixelFlags"].masksFpAnywhere
. You can have either a config file, or a pipeline that removes BRIGHT_OBJECT
from each of those (they should be able to be manipulated like a normal python list.
I agree this is not obvious, and difficult to track down if you do not work with these tasks often. Pipelines have a feature call contracts
that allows us to do some cross task config validation prior to anything running, and print out a useful message if one fails. I have created a ticket to create a contract to validate these config values so in the future no one else is bitten by this.
Q: Potential issue 1
A: That is not actually an issue, but a code execution forking path depending on the ability to do so. This message should not be printed out, and certainly not as many times as it is. There is a ticket existing to fix this issue already.
Q: Potential issue 2
A: This absolutely should not happen, and we have no issue in our other processing using makeSource
. I am not sure the exposureIdInfo
will do the right in in all cases. However at the moment I am going to need to look into this more, as the answer is not obvious to me. I do know it is likely from how you setup your butler initially, as that is there those expBits
comes from. I suspect you need to specify the collections for define-visits to look at when running that command with the --collections argument which I think is going to be something like DATA_gen3/HSC/raw/
, but you can look at butler querry-collections
to be sure.
Notes:
I would STRONGLY STRONGLY recommend against looking at the file system paths of the outputs from processing. This is abstracted to support may various back ends where no file system may even be present. Additionally it will guard against changes to these paths if the datastore changes where it decides to put the files. Please get used to interacting with data through the butler with the python or command line api. If you need to get the location
of a dataset to supply it to some other program, use the getURI
method through the butler. This will supply you with a ButlerURI
object from which you can get a path or use as a file handle to load/save the object. This will be extremely important when there are for instance S3 backed.
This is also important because dataset are not necessarily organized as you might expect from gen2. Data-products are associated in things called collections, which are roughly akin to reruns in gen2. However unlike gen2 there are no links (since no file system) between collections, instead the information on the relation between collections is all part of the butler. This means that if you load up a butler and say “list all the data in this collection”, if it is associated with other collections (whatever was an input when processing that data for instance) it will be a large list. When you look at that same “collection” on disk not all the same data will be w/n the file structure. There is a lot more to collections, but at this point I hope to use it as a user beware prompt.
If I missed anything I’m sorry just ping me and I will address it.