Reference catalogs, camFluxes, and colorterms

parejkoj · February 4, 2019, 7:45pm

After some discussion with @rowen, @erykoff, @KSK, and @jbosch, I think it’s time to start sketching out a new reference catalog interface. This conversation was spawned in part by RFC-535 when we realized that the *_camFlux fields in refcats were in fact currently being used as aliases, and in part by my annoyance at how we manage colorterm corrections to refcat fluxes.

Present state

Currently, the filter map (“use PS1 z for LSST y”) and colorterm (“take this combination of PS1 i and z to get a more correct LSST z”) corrections have to be configured and applied inside the tasks that use them (currently photoCalTask and jointcal, soon to be fgcmcal). This results in duplication of task configurations. Also, if a non-Task user loads a reference catalog, there are non-trivial steps required to get the “correct” fluxes for their desired filters, although DM-13054 somewhat improves the situation. As of DM-13054, to get the most correct available fluxes requires finding and loading a ColortermLibrary for the camera of interest, creating a Colorterm object from it, calling colorterm.getCorrectedMagnitudes() and passing those magnitudes around with the loaded refcat.

Sketch of a new system

My vision is that when a user loads a reference catalog (via either loadSkyCircle() or loadPixelBox()), the resulting in-memory catalog has all relevant corrections applied to it. That way, the user can get at the appropriate fluxes for their filters of interest without performing further transformations on the reference catalog, or having to know anything about the reference catalog fluxes themselves. This would not change the on-disk reference catalog representation.

This would require either that 1) the LoadReferenceObjectsTask be instantiated with the camera configuration required to correct it, or 2) that an external method/Task be called with the loaded refcat and the camera configuration to correct the refcat to that camera. We currently are setting *_camFlux aliases as part of the filterMap: we could re-purpose those (new name suggestions welcome, though) to be actual fields that contain the corrected fluxes. I don’t know how compatible option 1) is with the new gen3 reference objects task, but it is certainly the most straight-forward option from a refcat user’s perspective.

I know that @jbosch has plans for further SED and Transmission corrections, and I am curious those ideas line up with this. Certainly, ensuring that the loaded refcat has all relevant corrections applied would make it easier for code to immediately make use of future advanced corrections.

Performance questions

Applying all necessary colorterm and other corrections on reference catalog load does increase the compute requirements when loading a refcat. However, refcat loading is a small fraction of our compute time, and these calculations are an even smaller part of that: my recent changes to colorterms in DM-13054 sped it up by ~20%, but that change was basically immeasurable in tests of the reference catalog handling part of PhotoCalTask. For ap_pipe, we can pre-load the reference catalogs, so it should not matter to the 60 second budget.

Timeline?

The DRP team has already produced a new ReferenceObjectLoader as part of their gen3 work: It would be good to nail down a new API before that code goes into full production, even if all of the above features are not yet in place. We could then add transmission and SED corrections to it as desired in the future.

erykoff · February 4, 2019, 8:34pm

I agree that something must be done! And right now, to be clear, fgcmcal does also support the color term library in the final comparison to a given reference catalog, via PhotoCalTask (so with the same problem of the duplication of task configurations). Soon, this will be brought within fgcmcal directly, but it won’t increase the number of places these are used, just move one of them a little bit.

Anyway, whatever we do for a new API, I want to make sure it’s compatible with what Jim has in mind with transmission curves. Because what is a color term? It’s a very simple approximation for how a star of a given SED will look through two different transmission curves. Thankfully, the stellar locus is simple so you can indeed get close to selecting a stellar SED just by taking a single color (say, g-i), at least at high Galactic latitude where reddening is very small. (These terms are not appropriate for galaxies, SNe, etc, because they don’t have the same SEDs as assumed in the color terms).
Color terms work fine at the percent level, I’m actually not sure how far you can stretch things at the half-percent level (the closer your two systems are matched, of course, the simpler it is, so this is a significant consideration), and below this (beyond LSST spec, but desired for DESC cosmology) the simplicity described in a “color term” is definitely insufficient.

ivezic · February 4, 2019, 9:05pm

I agree with Eli: please let’s not assume linear relationships with colors!!! We need to use various integrals of the so-called phi function. Please refer to the SRD, LSE-180 and LSE-40 for more details, and do not hesitate to ask for clarifications if technical details are confusing!
Whatever decisions you make that might have impact on science: please do communicate them back to the DM Subsystem Science Team and Leanne Guy! They are supposed to keep me and the Project Science Team fully informed about such decisions!

jbosch · February 4, 2019, 10:30pm

I do not think it should be the responsibility of spatially-aware load-and-filter code to deal with bandpass and color issues, which I think is implied by having these transformations done automatically by both loadSkyCircle() and loadPixelBox(); I’d much prefer to see code that needs reference catalog explicitly load and filter them in one step, and then adapt them (if adapting is even possible; more on that below) in another step.

The reasoning is that I want to design for maintainability and extensibility first, and only worry about user convenience later - spatially-aware loading and bandpass/chromaticity correction are independently tricky problems that will be each changing significantly over the next few months for different reasons, and I want to make sure any redesign has a strong separation of concerns to keep those changes out of each other’s way. Adding user convenience code to reduce the boilerplate involved in typical usage is easier to do well later, once we’ve seen how much typical usage cases actually have in common.

The DRP team has already produced a new ReferenceObjectLoader as part of their gen3 work: It would be good to nail down a new API before that code goes into full production, even if all of the above features are not yet in place.

I’m afraid I was thinking the opposite. It’d be good to get @natelust’s thoughts on this, but I think the changes we’ve done so far to make a ReferenceObjectLoader usable with PipelineTask put us into an ugly hybrid state that will be easiest to clean up after we can retire CmdLineTasks entirely.

That said, if we do need to do the colorterms (etc) refactor now (and can identify effort to do it), it definitely could clean that mess up; the messiness all stems from the lack of separation of concerns I noted above. Essentially, Gen3 has a totally different approach to the spatial filtering, without any need (thus far) to change the schema remapping logic. Given the degree to which those are combined in the current tasks, it was really hard to separate those and we ended up duplicating some of the schema remapping logic (and certainly making it harder than I’d like to follow). If we were able to separate any schema remapping (or similar) for bandpass adaptation from spatial filtering in Gen2 as part of this refactor, the Gen3 implementation would fit much better.

In any case, as both @erykoff and @ivezic noted, the real challenge here is that we don’t want to just go from assuming we can pick a valid reference catalog filter to always using color terms; we want (in the not very distant future) to be able to go all the way towards using per-object SEDs and transmission curves to do color corrections, and any signficant refactor we do now needs to take that into account even if it doesn’t take us all the way there. In this paradigm, there isn’t actually any kind of bandpass adaptation that can be done on a full reference catalog; all that can be done are per-object adaptations (or alternatively per-match adaptations).

There are some moderately big prerequisites to getting all the way there:

We need to start inferring SEDs for detected objects (or at least detected objects with reference catalog matches); this involves identifying a way to do that, determining when we can do it in the pipelines (note that it requires full-color information), and how we want to represent SEDs in whatever catalogs they appear in.
We need to define some normalization conventions for how TransmissionCurves relate to PhotoCalibs (I think this is straightforward, but I haven’t thought enough about it to be sure).
We need to have a way to retrieve TransmissionCurves (or the same information via other means - more phi functions?) for all of the reference catalogs we care about.

I don’t want to block this refactor on actually meeting those prerequisites; I agree with @parejkoj and @erykoff that things are in a bad state now. But I think we need to go carefully through the thought experiment of how reference catalogs would work with SEDs, TransmissionCurves, and phi functions in the future, and be convinced any design involving colorterms that we go towards now has a straightforward way to be evolved into something more sophisticated.

Unfortunately, my ability to spend significant time thinking about this is also essentially blocked by the middleware transition, so I think I’ll just have to play reviewer, and @erykoff’s phrase “what Jim has in mind with transmission curves” implies much more thought on my part than has actually happened.

natelust · February 5, 2019, 4:26pm

I just want to jump in and say that I agree with Jim in that I hope the current implementation of ReferenceObjectLoader does not live to see commissioning. Much of the awkwardness stemmed from the need to not change much existing code leading to a weird dual system. At minimum it would be nice to leverage the new gen3 middleware for loading the catalogs with initial filtering, take a function to filter further (as is done now), and take an additional user supplied function to handle any catalog mutations. It would work equally well though to have these as separate steps a task may perform on some input reference catalog, as they are logically separate things. Unless of course region information is needed for transforming the catalogs (say looking up photometric information for some region) in which case I guess one stage would make more sense.

I would note that the current implementation is in between the two systems, creating the loader at a top task level, but passing it down to sub tasks to use as if they were responsible for loading. It would probably make sense with the new system (when we can retire the old) to separate concerns entirely. Tasks such as astrometry and photo cal should just be given a catalog to operate on, and have no knowledge or concern with how they got it, simplifying the concerns of those tasks. The loading and configuring of datasets should then be moved as close to getting the datasets off “disk” as possible.