Do Write Heavy Footprints In Sources appears not to have any effect

(had to artificially insert spaces in doWriteHeavyFootprintsInSources in the title of this post for the forum to allow me to make this post)

I’m using v19_0_0 of the pipelines and was hoping to limit the output file size of the “src” tables in cases with high source density using:

processCcd.py DATA --calib DATA/CALIB --rerun processCcdOutputs --id visit=725289 ccdnum=49 --longlog --config isr.doFringe=False --config calibrate.astrometry.matcher.maxRefObjects=3000 --config calibrate.doWriteHeavyFootprintsInSources=False &> processCcd.log &

However, the resulting src-0725289_49.fits file size with/without doWriteHeavyFootprintsInSources=False (and holding everything else fixed) ends up exactly the same (2+ GB). Checking DATA/rerun/processCcdOutputs/config/processCcd.py shows that this config setting does seem to be properly recorded:

grep doWriteHeavy DATA/rerun/processCcdOutputs/config/processCcd.py
config.calibrate.doWriteHeavyFootprintsInSources=False

Thanks very much.

Looks like this feature was removed in Fix pep8 warnings · lsst/pipe_tasks@1ff2a9c · GitHub without comment or notice.

1 Like

Actually, that’s unfair. That commit was merely removing a useless setting of a variable that was already unused. The real removal was Remove 'flags=sourceWriteFlags' from dataRef.put in CalibrateTask. · lsst/pipe_tasks@790c077 · GitHub

1 Like

That September 2016 commit gave the rationale “The Butler ‘put’ does not support a ‘flags’ option to pass down
to the underlying catalogs.”

Is this a question that should be revisited with the Gen3 Butler?

1 Like

There are ongoing discussions about the best way to handle this in Gen3. There is also [DM-26761] Add flags parameter for reading afw tables in gen3 - Jira for dealing with this on read (not write).

1 Like

Gen3 does not allow people to specify parameters on put. It’s up to the user to strip it ahead of time if they don’t think it will be needed. See also the discussion in DM-6927.

1 Like

Thanks all for the very helpful responses!

As a mere end user, I would indeed like the ability to specify this type of flag and have its intent be obeyed/propagated, but I can’t claim to know about all the other implications that might bring along.

@timj could you elaborate on “strip it ahead of time” for Gen3 – does that mean I would be manipulating a Butler object such that it discards the things I don’t wish to write-out before issuing the write-out command? Thanks again!

It could mean that you run the code to strip out the heavy footprints before writing the table, but ordinarily that would either be happening in the Task.run() method (controlled by configuration) or by the PipelineTask.runQuantum() infrastructure that is interacting with butler. Since you say you are a user and not a Task author, there’s little you can do at the present time.

v19 is an extremely old release of the software and many many things have changed (including the entire way we run pipelines and interact with data files). In this particular case we haven’t implemented any tasks that will strip the heavy footprints out so updating to a current release would not help you. On the other hand, migrating to the modern infrastructure would at least let you make use of any task configuration that does happen in the future.

1 Like

Thanks, Tim! Yes, I am aware that v19 is quite old. The context is that I’m reducing DECam data, and my understanding has been that DECam support in Gen3 is currently under development, so I’m trying to figure out how to balance using something stable versus not getting ridiculously far behind. At any rate, I am very interested in learning to use Gen3 in the near future.

As far as I’m aware DECam support in Gen3 is working great unless you need to use the community pipeline calibrations. Even if you were wanting to use gen2 you should be using v23.0.2. We are using DECam data with gen3 in many of our alert production tests. @lskelvin or @mrawls can comment in case I’m missing something subtle.

See for example:

1 Like

Thanks! Yes, using Lee’s guide to learn DECam Gen3 processing has been on my to-do list for a while. Thanks also for the specific recommendation about a more recent version of Gen2.

It is convenient to be able to use CP master cals with Gen2, but I’m alright with learning a different way to handle master cals for Gen3.

What is the status of custom reference catalogs in Gen3? Our group wants to, for instance, process DECam data below Dec = -30 where there’s no PS1 available. @lskelvin @mrawls

Reference catalogs aren’t really different between gen2/gen3. What reference catalog are you thinking of using? We have instructions for converting external catalogs to our refcat format. See this Community post for a Gaia DR2 refcat (though that doesn’t help with photometry).

1 Like

@parejkoj Thanks for the info! Our group has had good success with making LSST pipeline formatted reference catalogs for our Gen2 DECam reductions, so that’s nice to hear that references catalogs don’t really change much between Gen2/Gen3. Examples of southern data sets we’ve used as reference catalogs with Gen2 are DECaPS (DECam Plane Survey), SkyMapper, and NOIRLab Source Catalog. Thanks again.

I am currently running a Gen3 version of the LSST pipelines (v23_0_1) on DECam data, and I’m seeing what I think may be essentially the same failure of my specified config argument(s) to have any effect on which outputs get written/saved, though now in the context of postISRCCD and icExp outputs, not the src table outputs discussed previously.

I want to not have either postISRCCD or icExp outputs saved to disk after running CCD-level calibrations, by which I mean effectively the Gen3 equivalent of what used to be called processCcd.py in Gen2. Even better would be if postISRCCD and icExp never get written to disk to begin with. The combination of postISRCCD and icExp outputs is roughly tripling the CCD-level calibration step’s output data volume relative to what I was getting with Gen2 (in Gen2 the output data volume was entirely dominated by calexp).

Specifically, the config options -c characterizeImage:doWrite=False and -c characterizeImage:doWriteExposure=False appear not to have any effect on what outputs get written/saved. For context, the full command I’m running looks like:

OUTPUT=DECam/runs/reduce_data_volume

LOGFILE=$LOGDIR/step1_skip_transformPreSourceTable-reduce_data_volume.log; \
date | tee $LOGFILE; \
pipetask --long-log run --register-dataset-types  \
-b $REPO --instrument lsst.obs.decam.DarkEnergyCamera \
-i $INPUT \
-o $OUTPUT \
-C calibrate:config/processCcd_g_config.py \
-c characterizeImage:doWrite=False \
-c characterizeImage:doWriteExposure=False \
-p DRP-Merian-step1-skip_transformPreSourceTable.yaml \
-d "instrument='DECam' AND $DATAQUERY" \
2>&1 | tee -a $LOGFILE; \
date | tee -a $LOGFILE

I’ve attached the associated input YAML pipeline definition file and input calibrate:config file. I have checked that these doWrite and doWriteExposure config parameter settings from my launch command are indeed getting propagated into their associated output .py config files.

Beyond setting characterizeImage:doWrite=False and characterizeImage:doWriteExposure=False, I also tried adding in -c isr:doWrite=False, but then this crashed right from the outset with:

RuntimeError: QuantumGraph is empty.

I do not understand why the writing versus not writing of ISR results would affect the QuantumGraph — I would think that ISR results can simply be held in memory during other calibration steps for each CCD? Or is such holding of data in memory not allowed, because e.g., isr and characterizeImage are distinct Tasks within this processing framework?

As an alternative, I have been exploring butler prune-datasets, but I have some questions about that as well, which I will plan on splitting off into a separate, new forum topic. Thanks very much.
DRP-Merian-step1-skip_transformPreSourceTable.yaml (1.6 KB)
processCcd_g_config.py (359 Bytes)

There is a pipetask --log-file option that can write you a log file automatically without having to capture stdout. You can select a .txt file for the normal output or .json if you want something that is easy to parse later.

If you don’t write the outputs then those outputs do not exist and so any graph building that needs the downstream products will not be possible. In Gen2 there was a single processCcd script that did all the processing in memory and chained the run methods together explicitly. This is fundamentally different to how Gen3 operates where each “quantum” is a discrete entity that can be run either in a single process if you are calling it from pipetask without -j or as discrete processes when executed by bps.

Intermediates always have to be written to a butler and the next quantum always has to read from a butler. The design of Gen3 leaves what happens to that intermediate to the butler datastore. In theory you can setup your datastore to include an in-memory variant that only accepts the intermediates but this is a bit tricky to get working properly since you can only use it for quanta that share the same process memory. You can also configure a chained datastore that has one datastore writing to scratch space that only accepts the intermediates. We still would write entries to the registry because of provenance. This isn’t working perfectly yet because we haven’t tied in a proper provenance system that can determine that some intermediates can be dropped from a temporary datastore when the processing completes (registry doesn’t know that your scratch space datastore is temporaroy).

1 Like

Thanks, Tim! I will look into pipetask --log-file, as that definitely sounds relevant/interesting to me.