Presentation of coadded image data to users

gpdf · July 7, 2016, 6:42am

We have been working on refining the ideas for the SUIT, and this is leading to some questions that feed back to the definition of the released datasets and their processing.

For access to single-epoch image data there is a great deal of prior art that helps us understand how users may wish to look for it and view it. It is reasonable to view the single-epoch data as composed of a set of images (resolvable at FPA, raft, CCD, or even amplifier scales), taken at well-defined times and at well-defined pointings, for which catalogs of image metadata may be queried, and which can then be displayed or used to source cutouts for limited regions in sky space or detector space. This logic applies to raw snaps, calibrated visits, and difference images alike.

In early drafts of the SUIT requirements the coadded image data was also tacitly treated in the same way, as if it were composed of a set of “images”.

Upon reflection, it’s not clear that this is as obviously useful a model for access to the coadded data. An argument:

In the absence of concerns about spherical geometry, non-uniform pixels, and the like, the notional ideal of a coadd is essentially of a single all-sky image (of which we’ll have a few, then, e.g., per-band deep coadds, template coadds, etc.). In this ideal, a user would typically not be interested in locating “an image”, as this would be an unwieldy all-sky image, but would really want to ask about the availability and properties of “image data” (a mass noun) in some selected region of sky, say. A catalog of image metadata, typically imagined with columns representing, say, observation times and pointings, is not really a very meaningful concept in this view. Instead, a “what coadded data do you have near position X” query would produce a short (quasi-static) list of the set of coadds that are in the data release (an answer that doesn’t depend much on X as long as X is part of the survey*), and then a set of metadata about the coadds which for the most part is not naturally per-image and tabular but is position-dependent - the stack depth, the achieved limiting magnitude, etc. - and itself image-like, representable as density plots or contour plots or the like, at a variety of (we assume user-selectable) scales. (Note that the DPDD appears to be silent on the existence of per-pixel stack-depth or limiting-magnitude maps, though I have not checked this obsessively; if so, this is a separate concern.)

In this ideal, the way that we compute coadds (by tiling the sky in some way to make the work, um, tractable) is effectively an implementation detail that a user might not initially want to need to care about. Of course, ultimately we will need to expose this information, if for no other reason than to support provenance inquiries in which it becomes essential to think about units of coadded sky space that were computed by specific “jobs” (or equivalent concept) in the DRP system. But even some obvious provenance queries, like “what single-epoch images went into this coadd” actually don’t break down cleanly into a per-patch “image metadata” model - the answer to that question is position-dependent at the pixel level.

From an idealistic standpoint, I think the “continuous image data” model is clearly preferable to always treating the coadded data as, effectively, a huge collection of patch-scale images. I think it’s also achievable, but it requires us to think some more about how to cope with the realities of the coaddition model we have chosen.

This is relevant not only in the design of the SUIT, but also of the Data Access-provided cutout service. The key issues arise at tract edges where overlaps occur and geometrical distortion is at a maximum.

From a user’s perspective, when requesting cutouts around positions of interest, there is a natural interest in being able to issue a query that says no more than “from all-sky coadd type C in band B, give me a cutout of size S around location X” and produces a single, unambiguous result. If we have an API that can frame that question, what does it do in overlap regions? Which of the overlapping tracts does it choose to source the cutout? Does it just punt on that question and always return multiple answers when all/any (which one?) of the cutout region is covered by more than one tract? Will we say that one of them is “preferred” / “authoritative”? Does the cutout service need to provide options to choose one or the other? Will it have options to return both “pure cutouts” - verbatim pixels from one or the other applicable tract - and full-resolution tangent-plane-reprojected cutouts?

When combining this with catalog queries, additional questions arise. A natural function of the SUIT will be “show me the (coadded) sky around LSST object Z”.** For objects in tract-overlap regions, they may have been detected on one or the other or (usually) both tracts, with duplicates removed and only one detection used to seed Multifit. With that in mind, the obvious default behavior of “show me the (coadded) sky around LSST object Z” should be “show me the vicinity of Z on the image that yielded the detection that was actually used” (though with an option to show the other, if any). For SDQA and Apps-development purposes, “show me both and let me drill down into how the duplicate was resolved / why it was seen on tract A and not tract B” would also be useful functions, of course. In order to support this functionality, the Object table must contain metadata that allows the identification of the tract on which the actionable detection was made. The DPDD is currently silent on this.*** For a variety of reasons, this cannot be short-cut with a strategy like defining non-overlapping fiducial regions interior to tracts.

(I have not mentioned patch-level overlaps. I gather that the intent is that these should have bit-for-bit identical pixel content, though this has not yet been achieved. If this works out, then patch overlaps can be made nearly invisible to the user. This is not possible for tracts, since the pixel distortions are different in the overlap regions of two - or more - tracts.)

Some of this was discussed in early 2015 in https://jira.lsstcorp.org/browse/DM-1916 and on the https://confluence.lsstcorp.org/display/DM/API page.

*: but see these concerns regarding DRP processing of deep drilling fields

**: as well as “show me cutouts from all the single-epoch calibrated image data around LSST object Z”; note that the latency for such queries will be very strongly dependent on the latency for availability of the single-epoch calibrated images themselves. @davidciardi has pointed out that this is likely to be a commonly requested function and that users will be unhappy with the large latencies that would arise from recomputing O(1000) calibrated visit images.

***: I am told that this might be encoded in the object ID. If so, the fact that the object ID has embedded fields with publicly useful semantic content should be documented.

RHL · July 7, 2016, 1:45pm

Isn’t this implicit in the variance images (assuming we handle/propagate the covariances correctly); see section 3.2 (or search for “mask”)?

We are also planning to carry out some fake-source injection (cf. my thesis or Peter Melchior’s Balrog) although I can’t find this in the DPDD, and there is concern that we don’t have enough compute to inject enough sources to characterise our data well enough.

RHL · July 7, 2016, 1:46pm

In theory I agree, but as the processing (including e.g. deblending) is envisaged as being done at the patch level (with overlaps) it may not be so in practice.

RHL · July 7, 2016, 1:49pm

See the discussion of patch/tract resolution in LDM-151 (section 5.4). It’s tricky, but I think it’s a DRP responsibility to provide an `official’ resolution of the overlaps. For what it’s worth, SDSS had to do this for strips and stripes.

RHL · July 7, 2016, 1:51pm

The pixels can be (modulo sky subtraction), but I don’t think it’s possible for the catalogues to be due to deblending questions.

RHL · July 7, 2016, 2:21pm

It is in the objectId, and there need to be mapper-level functions to split it out. Hmm, I don’t see an issue; now there is: https://jira.lsstcorp.org/browse/DM-6912

jbosch · July 7, 2016, 4:05pm

It’s certainly more than just variance images, because our completeness is also a function of the PSF, whatever masking may have happened at that point, and probably blending (though I’m not sure how much it’s our job to characterize that), as well as the properties of whatever we’re measuring the completeness of. Section 3.2 has a vague mention of the kind of stuff we need to do, but it really needs to be expanded.

swinbank · July 7, 2016, 4:10pm

I know that both @jbosch (since he created it) and @gpdf (since he commented on it) are aware of it, but for the benefit of others I feel we should cross-reference DM-6366 here.

gpdf · July 7, 2016, 4:16pm

And on the “semantic content of opaque IDs should be documented” issue: DM-6913