New use of `isPrimary` that began in `w_2021_12`

This post is a bit late, as the relevant code was introduced almost 2 months ago, but it was brought to my attention that there is some confusion over which flags to use in order to obtain a catalog of unique sources. First I’ll give a bit of a backstory as to why this changed, followed by a description of the new flags and how they should be interpreted.

The input mergeDet catalog to the deblender contains a list of parent sources, each row consisting of information about the parent, a footprint (a boolean mask of pixels in the input image that were detected as part of the parent blend), and a peak catalog (a list of peak locations and central flux values for all detected peaks in the parent blend).

Single-band deblender

Prior to the adoption of scarlet as the default deblender, all deblending was done with meas_deblender.SourceDeblendTask, which is still the single visit (single band) deblender. The basic idea of the algorithm is that most galaxies roughly exhibit 180 degree symmetry and (at shallow depths) are only blended with one other object in most cases. So a symmetric template (numerically equivalent to x = np.min([x, x[::-1]])) is made for all of the sources in a blend that do not fit the PSF, and the flux in the image is re-apportioned to each source based on the ratio of the templates for each pixel. This algorithm can fail for non-symmetric sources and for any blends where a source is blended with neighbors on both sides, causing inaccuracies in the measurement of all three sources. As the depth of the images increases blending becomes more severe and the instances where this algorithm fails increases, which is why a different deblender is used for co-added images in multiple bands.

Because the templates generated by the deblender are used to weight the flux from the image, the total flux in the footprints is conserved. This means that for isolated sources, the template that would be created by SourceDeblendTask is irrelevant, as all of the flux in the footprint would be returned. The result is an output catalog (in each band) with all of the parent blends at the top followed by all of the children deblended from one of the parents. So when this was the only deblender, selecting a set of unique sources was easy, you just cut on deblend_nChild == 0. This selected all of the isolated sources (from the parent section) and all of the deblended child sources.

Multi-band deblender (scarlet)

meas_extensions_scarlet.ScarletDeblendTask is different, as it uses scarlet to create a model for each source in a blend. This is a philosophically different object, as there is no longer an assumption that all of the flux in the input image will be modeled by one of the children in the blend (this may change in the near future, but this is the current implementation). The results of the scarlet deblender will be biased by the assumptions that went into making the models, so it was decided that it would be a good idea to (by default) also model all of the isolated sources. This will allow comparisons of scarlet models of isolated sources to the un-modeled isolated source measurements to investigate the biases that scarlet is introducing and also gives users the option to choose between the un-modeled (parent) isolated source records and the scarlet model version of each isolated source. However this flexibility forced a change in the way that we select unique objects in a source catalog.

Flags set by the deblender

Before we get into the flags set by pipe_tasks it is useful to understand the flags that are set in SourceDeblendTask and ScarletDeblendTask that relate to source selection.

  • parent: the id in the catalog for the parent of this source record. This is actually set pre-deblender, where all top level records have parent=0.
  • deblend_nPeaks: the number of peaks contained in the sources footprint.
  • deblend_nChild: the number of peaks deblended by the deblender from this source and created as new source records in the catalog. This is different from deblend_nPeaks in that isolated sources that are not deblended by SourceDeblendTask and child peaks that were culled during deblending are not included in this count.
  • deblend_parentNPeaks: The number of peaks contained in the parent of this source record.
  • deblend_parentNChild: the number of children deblended from the parent of this source record.

isPrimary and other flags added in pipe_tasks

In addition to source records for deblended parents and multiple entries for isolated sources, output catalogs are also not unique because they may contain “pseduo” sources (eg. sky objects that have been added to assist with calibration but are not output sources) and, if the analysis is done over multiple patches and/or tracts, sources in the overlap region can exist in multiple overlapping patches (but always on the interior of only one). For this reason the pipe_tasks.SetPrimaryFlagsTask sets a number of useful flags to assist users in determining a unique output catalog for their analysis.

detect_isPatchInner and detect_isTractInner

True when:

  • A source is in the inner region of a patch
  • A source is in the inner region of a tract

Details

The detect_isPatchInner and detect_isTractInner flags are used to identify sources that are contained in the interior region of a patches (and tracts). By definition every point in the sky is located on the interior of a patch and tract, however they also include an outer region that overlaps with neighboring patches/tracts. Sources with a False value for either flag are included in the overlap region and will show up multiple times in a combined catalog. In practice it would be useful to have a more clever algorithm for choosing which source to use on the edge of a patch/tract, since some sources will be cutoff, however these flags give a quick way to ensure that a catalog using multiple tracts/patches is unique. So an easy way to get unique sources is to select all of the sources with detect_isPatchInner==True & detect_isTractInner==True.

sky_source and merge_peak_sky

True when:

  • A source is flagged as a sky_source in a single visit catalog

or

  • A source is flagged as merge_peak_sky in a mergeDet coadd catalog.

Details

sky_source is a flag in a single visit catalog to mark sky objects while merge_peak_sky is the coadd version (which states that a source was a sky object in at least one band). Any sources with either of these flags set should be ignored in a final source catalog as they are not astrophysical objects.

detect_isIsolated

True when:

  • A source only has a single peak (deblend_nPeaks == 1)
  • A source is a top level parent (parent == 0) or its parent only had a single peak (deblend_parentNPeaks == 1)

Details

The detect_isIsolated flag marks sources that are not contained in a blend. This covers both isolated sources that are not modeled by the deblender (parents) and (in cases where the multi-band deblender is used) scarlet models of the isolated sources. Note that cutting on this flag will not give a unique set of sources, but can be useful for selecting all of the isolated sources to analyze the differences between measurements made on scarlet models and measurements made on the same isolated sources.

detect_fromBlend

True when:

  • A source is deblended from a parent that had multiple children (deblend_parentNChild > 1)

Details

The detect_fromBlend flag is used to mark sources that were deblended from a parent that contained multiple children. This is not the opposite of detect_isIsolated because it does not contain parents that were deblended into multiple sources.

detect_isDeblendedSource

True when:

  • The source is a top level parent and it is isolated (detect_isIolated & parent==0)

or

  • The source was deblended from a parent with multiple children and has no children of its own (detect_fromBlend & deblend_nPeaks == 1)

Details

Current testing shows that the un-modeled isolated source measurements perform (perhaps unsurprisingly) better than the scarlet models of isolated sources in most cases, so the default set of unique sources uses the unmodeled (parent) isolated sources and scarlet models for sources in blends with multiple children. These sources are identified using the detect_isDeblendedSource flag, which is equivalent to (detect_isIolated & parent==0) | (detect_fromBlend & deblend_nPeaks == 1). Checking that deblended sources only have a single peak in their footprints allows for potential hierarchical deblending in the future, where there may be several different hierarchies of deblended sources.

detect_isDeblendedModelSource

True when:

  • The source is not a top level parent (parent != 0)
  • The source does not have any children (deblend_nPeaks == 1)

Details

The detect_isDeblendedModelSource flag only exists when the mutliband deblender is used, marking sources that were deblended from a parent. This includes both isolated sources that were modeled by scarlet and sources deblended from a parent with multiple child peaks. If your preference is to always use the scarlet model to ensure that the isolated and deblended sources have the same underlying models, then joining on detect_isDeblendedModelSource & detect_isPatchInner & detect_isTractInner & ~merge_sky_peak will give a unique set of sources that is the equivalent of detect_isPrimary, only using the scarlet isolated models as opposed to the un-modeled isolated source records.

detect_isPrimary

True when:

  • A source is located on the interior of a patch and tract (detect_isPatchInner & detect_isTractInner)
  • A source is not a sky object (~merge_peak_sky for coadds or ~sky_source for single visits)
  • A source is either an isolated parent that is un-modeled or deblended from a parent with multiple children (isDeblendedSource)

Details

The detect_isPrimary flag can be thought of as a flag to include the most common catalog of unique sources that users will want to make measurements on. However it is advised that users understand the assumptions made in using sources marked with this flag and whether or not it suits their needs.

Thanks, Fred - this is really helpful! I have a couple of initial questions (and need some more time to digest the details):

Does this mean to say to select sources with these flags set to True, or am I misunderstanding?

Is this a change from previous behavior for detect_isPrimary? I think that in the past this flag did not completely ignore the overlap regions. If it really does leave out overlap regions, then could you clarify which flag(s) one would use to select “the best measurement of all well-measured sources?” I recognize that this is vague/ambiguous, but am just thinking about what a typical user might select as a “clean” catalog for their science.

Does this mean to say to select sources with these flags set to True

Yes

Is this a change from previous behavior for detect_isPrimary?

Good question, I see that wasn’t clear. No, the behavior of isPrimary has not changed, and overlapping sources have always been left out. The only thing that really changed is the ambiguity between two different types of isolated sources, resulting in us having to choose which set we used as “primary” sources.

I don’t think that overlapping sources have always been left out by isPrimary? That would imply that if you plotted a spatial map sources with detect_isPrimary==True, you would see gaps between patches. Maybe I’m misunderstanding?

This has at least been true since I started looking at isPrimary about 5-6 months ago, and definitely was not a change introduced as part of DM-28542. It is the reason that @laurenam’s code in pipe_analysis does not use isPrimary, causing her to create her own flags. See pipe_tasks before DM-28542.

The definition of tract and patch in skymap seems to be clear that every point in the sky belongs to the inner part of exactly one tract or patch, never zero or more than one. A given point may also belong to the overlap region of one or more tracts/patches.

This is similar to Qserv’s internal “chunk” concept, where each point in the sky belongs to exactly one chunk and also zero or more overlap regions.

Ah, thanks, K-T. That makes sense (and jibes with my experience). It could cause a lot of headaches if one was unable to easily extract sources/objects over a contiguous region.

Indeed. A rough definition of the flag is:

detect_isPrimary: the absolute minimum science users need to get a viable set of objects with no duplicates and no sky objects.

To satisfy the “no duplicates” criterion, in the overlap regions, only the sources in a given patch/tract’s inner bounding box are included. There is an inherent assumption that one is considering a large dataset consisting of overlapping (and contiguous) patches and tracts. This is actually not the case in pipe_analysis which works at the tract level. Since I don’t want to omit the outer bounding box of the tract all together, rather than use isPrimary, I select on

detect_isPatchInner

(along with the other relevant flags to get rid of parents and sky objects),
but not

detect_isTractInner

Thanks for the clarifications KT and Lauren, and thanks for noticing the discrepancy Jeff. I updated the original post accordingly.

I’m trying to use these flags to develop metrics about the deblending process itself, and I’m a bit confused by the distinction between _nPeaks and _nChild. From the definitions, it looks like the latter reflects the actual deblending process, but the former tends to be used in derived flags (e.g., detect_isIsolated, detect_isDeblendedSource). Can you explain why _nPeaks is the correct way of counting sources that have or have not been deblended?

Good question. There was some debate about this and IIRC the reason that we settled on using nPeaks as opposed to nChild is that nPeaks tells us the number of peaks that the detection algorithm believes are in the footprint. So while the deblender might disagree, for example it might fail to deblend one of the two peaks or assign no flux to one of the peaks, I wouldn’t call a source truly “Isolated” unless there is no evidence that there might be another source in its footprint. But i would call a detected peak evidence that it might not be isolated.

This does not happen often (I can’t quantify but it’s < 1%, maybe <<1%), but it can happen and I wouldn’t want to pass through a source as isolated when there might be another source that just wasn’t modeled well by the deblender.

I agree for “isolated”, but what about detect_isDeblendedSource, where it’s used to implement the condition “has no children of its own”? I’m having trouble applying the same logic there.

In order to include all of the sources once (and only once) each source has to either be classified as isolated or blended. So if a source isn’t isolated then it must come from a parent with multiple peaks.

The “has no children of its own” is future looking, as SourceCatalog was designed to be hierarchical and there’s a good chance that in the near future there will be deblended sources that also have children (multiple hierarchies of deblending).

1 Like

Is this list of columns still current? In a recent AP run, the source catalog src had a deblend_parentNPeaks column but no deblend_parentNChild. On the other hand, it had both deblend_nPeaks and deblend_nChild