Notes from DRP High-Level Data Flow Discussion

Meeting for RFC-152 was held 12pm Pacific 2016-03-02. Attendees: @jbosch, @RHL, @ivezic, @ktl, @timj, @hsinfang, @jdswinbank

The follow is @jbosch’s original agenda, annotated with @timj’s notes, after further adjustments by @jbosch.

AGENDA

  • Preview Big Questions

  • Go through abstract Preprocessing Stages quickly (not the focus of this meeting)

  • Go through abstract Pipeline Stages, thinking about Big Questions

  • Revisit Big Questions

  • Go through the proposal for concrete pipeline stages on DM-4839

BIG QUESTIONS

  • What have I missed (or incorrectly assumed)?

  • What do we need to build the PSF star catalog, and should we reuse the one from the last DR?

  • Do we need to do any chromatic-aware processing in Image Characterization? If so, how/when do we revisit it once we have colors?

  • How do we build artifact masks for coaddition, and do we do that before or after we build templates for image subtraction?

  • What kinds of coadds do we build?

  • Is there any way we can remove the big MOPS sequence point?

PREPROCESSING STAGES

(do these before processing the main wide survey)

Calibration Products Pipeline

  • Generate typical ISR inputs.
  • Generate brighter-fatter kernels.
  • Generate wavelength-dependent photometric corrections

PSF/Wavefront Characterization

  • Use dedicated observations taken at various focus levels to constrain relationship between wavefront sensor measurements, telescope telemetry information, and star images. I think of this as computing some sort of basis function set for PSF modeling in wavefront space, but I’m waving my hands.

Sensor Distortion Characterization

  • Use a set of many dithered observations (probably on a moderately dense field) together with flat fields to constrain frozen-in-sensor coordinate system effects (tree rings, pixel area variations). Could use the full survey dataset for this, but probably don’t need to. Could just be data from deep-drilling fields.

Deep/External Processing for Bayesian Priors

  • Process deep-drilling fields (and possibly joint-process with other surveys in overlap regions) first, so we can use measurements (e.g. distribution of galaxy ellipticities) as a prior on the wide survey.
  • Processing otherwise uses the same procedure as the main survey, but we need to figure out how to bootstrap priors (probably just using less informative priors in deep fields).

PIPELINE STAGES

Image Characterization

  • Includes ISR, snap-combine, first-pass backgrounds, single-frame astrometric and photometric calibration.
  • Includes some PSF estimation, but possibly (probably?) not the final one.
  • Does not necessarily include detect/deblending/measurement for final Source table, if that requires better PSF, background, or calibration.
  • No chromaticity here; if we need e.g. chromatic PSFs at this stage, we need to find a way to get colors at this stage.
    • @rhl is concerned this might end up being necessary.
  • Maximum sequence point scale is a single visit, but most work only needs a single CCD. Probably scatter-gather between visits and CCDs (but maybe multiple cores working on a single CCD, too).
    • @rhl: don’t assume that we can approximate this as CCD-by-CCD processing; full-visit stuff is very important.

Joint Calibration

  • Fit for best astrometric and photometric calibration for all exposures overlapping a region of sky (area of a handful of focal planes?)
  • Lets us constrain spatial variations on smaller scale than reference catalog allows. Expect Gaia to constrain larger scales.
  • First opportunity to estimate colors for objects; if we don’t use a PSF star catalog from a previous DR, this is our first opportunity to construct one.
    • @rhl is concerned that the PSF star catalog we construct at this stage might not be good enough; would be better to build one using classifications from a better PSF.
  • Sequence point for all visits overlapping a region of sky. Most work is probably multithreaded linear algebra, but I/O could also be a bottleneck if we don’t parallelize it intelligently.

Final PSF Estimation

  • Needs PSF star catalog for better S/G separation, colors to use for chromatic PSF determination.
  • Maximum sequence point is probably a single visit (some thoughts about using galaxy difference images, but that’s probably not a baseline kind of idea). Most work is probably on a visit scale, need to think about how to parallelize effectively within that.
    • @rhl is not convinced this guarantees a correct split between PSF and WCS; will discuss off-line with @jbosch. Iteration between PSF estimation and Joint Calibration would address this, and we might need that to improve the PSF star catalog anyway.

Coaddition

  • Lots of kinds of coaddition, may not want to do all of them.
    • Direct coadds with discontinuous lazy PSF - lossy, but less lossy than PSF-matched. Can’t do outlier rejection. Can update PSF model without updating the coadd itself.
      • @ivezic: why do we not use “stackfit” terminology? Common jargon.
      • @rhl: This is not the full stackfit implementation and only uses one idea from stackfit paper. Another term we use is CoaddPsf.
    • PSF-matched coadds - most lossy, but with simple, continuous PSF. Needed for aperture and moments measurements to be consistent across the sky and across bands. Can do outlier rejection.
    • Matched-filter coadds - not lossy, but only directly usable for detection and point source measurement. Can’t do outlier rejection.
    • Decorrelated coadds - optimal, but unproven with lots of algorithm development needed, with continuous but complex PSF; need some new tradeoff to deal with computational complexity or (in Kaiser coadd limit) unrealistic assumptions. Can’t do outlier rejection.
    • Decorrelated and PSF-matched - much less lossy PSF-matched coadd, derived from decorrelated coadds. Unproven, lots of algorithm development needed. Can’t do outlier rejection.
  • All coadds but PSF-matched require artifact masks derived from some sort of difference imaging (maybe background matching).
  • Probably want coadds across bands for detection matched to more SEDs (chi^2 is just a special case for sky SED?). Need to decide whether to actually use different PSFs for different hypothesis SEDs; will need to do something like this for diffim templates, at least.
    • @rhl: I’m not as worried about this as I used to be; slightly lower thresholds could do this more easily.
    • @ivezic: Did some investigation of (chi^2 coadds?) across all bands and nearly optimal.
  • Also need coadds for different epoch ranges, to detect faint slowly varying or moving objects.
  • Sequence point for all CCD images (possibly visit images) that overlap a patch of sky (smaller than region needed for Joint Calibration). Most processing on visit scale before gathering epochs.

Background Matching

  • Estimate (N-1) backgrounds on N exposures that overlap a patch of sky.
  • Probably should happen during coaddition (many shared operations).
    • @ktl: In the past we thought background matching was different to coaddition, because it needed larger spatial scales
    • @jbosch: Relationship is unclear; needs experimentation. Could imagine warping all images first, then background matching then coadding.
  • Could be used to find some artifacts, since we produce difference images (but after warping all science images to coadd coordinate system, and with weaker requirements on matching kernel quality). Not clear if we can use this to find all artifacts (ideally want to find them before warping).
  • Probably a sequence point for all visit images that overlap a patch of sky (and this is why coadd production probably needs full visits rather than CCDs).

Source Measurement

  • If this needs final PSF estimation, calibrations, and/or backgrounds, it needs to happen after those are produced.
  • Maximum sequence point is a single CCD image.
  • @jbosch: Not sure what science requirements are.
  • @rhl: Need to keep it until we demonstrate multifit can, for example, do astrometry
  • @jbosch: Might be significant savings for database system if we can drop it
  • @ktl: Definitely.
  • @rhl: How do we get colors to this stage to do chromatic processing? Problematic.

Image Subtraction

  • Needs template coadd as an input.
  • Probably needed to generate artifact masks for coaddition.
    • @jbosch: Could find CRs in snaps, but other sharp artifacts could be hard to find in background-matching, which is post warping.
    • @rhl: Don’t always assume we will have snaps
    • @ktl: Would we get rid of snaps without having alternate CR rejection system?
    • @rhl: We have template images in DRP.
    • @jbosch: Might be harder for alert processing.
    • @rhl: Read noise issues in u-band may be an issue; could drop snaps to improve SNR
  • Unlike Nightly, could use PSF models as prior for kernels or do full-visit kernel modeling some other way.
  • Maximum sequence point is probably individual visits; maybe CCDs if we just use per-CCD matching kernels.

Coadd Detection

  • Find sources on a suite of coadds - generates Footprints with Peaks, not deblended sources.
  • Probably runs on Matched-Filter coadds.
  • Sequence point is just a suite of matched-filter coadds for different SEDs and epoch ranges in a patch.
  • @ktl: Please clarify which of these steps depend on explicit ordering.
  • @jbosch: That’s in the concrete proposal; this is just a weak ordering. But coadd detection does not depend on image subtraction any more than coadds themselves do.

MOPS

  • Associate fast-moving object detections from Image Subtraction with existing solar system objects.
  • FULL SURVEY SEQUENCE POINT (!?)
    • @jbosch: Only sequence point requiring full data set. Is that a problem?
    • @ktl: Used to be another one for global calibration.
    • @rhl: Gaia will hopefully provide best calibration on larger scale, so we think we only need to do calibration on smaller scales (~40 deg^2): see Joint Calibration.
  • Unless we really want to start each DR from scratch, will want to update an existing solar system database - possibly tricky since all of the images in the DR will have already contributed to the nightly MOPS database using different processing.
    • @jbosch: Using the existing database as a starting for an iterative procedure is probably fine, if we don’t use it to inform final result. Otherwise statistical double-counting.
    • @ktl: Big worry is asteroids that disappear in DRP but were in L1 nightly
    • @timj: What is threshold for acceptance at MPC?
    • @rhl: Probably being too squeamish. Guess is this is not a real problem

Association and Deblending

  • Associate and merge peaks from coadd detection, image subtraction, and MOPs to form Object Candidates.
    • @jbosch: I think deblending needs to know whether the field has candidate asteroids that are blended in. Faint
      asteroids may need their orbits. Hence dependency on MOPS. Is that right?
    • @rhl SDSS experience is that known asteroids will need to be included.
  • Reject redundant Candidates and deblend into Objects using suite of coadds.
  • Might be able to split this into a pure catalog association operation followed by image processing, but I’m not sure there’s value to that.
  • Sequence point is over a patch of sky, but with a potentially large suite of coadds. Need to find a way to parallelize within that scale (pixels? object families?)

Coadd Measurement

  • Run measurement algorithms on coadds, using deblended pixels. Probably need a suite of coadds (no larger than suite needed for deblending, possibly smaller).
  • Sequence point could be a patch of sky with a suite of coadds. Could also parallelize independently over object families, but determining the extent of those families prior to running measurement is problematic.
  • @jbosch: Need to fit across bands.
  • @rhl: Worried about large blends. Can you define “isolated” ever if you are deep. Maybe larger than 4k x 4k object families. Galactic centre would be too large.

MultiFit

  • Fit models to individual exposures using coadd measurements as starting point. Should be able to use coadd measurements to put hard bounds on region of each exposure needed for fitting a particular object.
  • Ideally parallelize over object families, maybe over pixels within that.
  • Probably need all bands in memory simultaneously.
  • May need some sort of divide-and-conquer for large blends.
  • Need clever algorithms for iterating over objects in a way that’s I/O and CPU-throughput friendly (especially given large dynamic range in object family size).
    • @timj: Does multifit handle proper motion fitting?
    • @jbosch: Astrometry community doesn’t necessarily believe we’ll pull it off, but we’re either braver or more foolish. MultiFit will fit at least PM and Galaxy shear.
    • @ivezic: A simple demonstration would be very useful.
    • @rhl: Need a real dataset but may not have enough of a sample yet. Maybe Phosim can generate reasonably “bad” data.

Forced Photometry

  • Measure fluxes at the position of every detection in every exposure, holding everything but amplitudes fixed.
  • No obvious reason this couldn’t be done during multifit, but it could also be done by parallelizing over CCDs and asking DRP DB for reference catlaog that overlaps it.

Missing Stuff

  • @rhl: Bright star masks
  • @ktl: Building epoch-based coadds is mentioned in DPDD. Do these have overlaps in time? In which case you can’t coadd the epoch ones into the full range.

Return to Big Questions

  • Can we reuse PSF star catalog from previous DR?
    • @rhl No question it is the wrong thing to do to reuse, but maybe it doesn’t really matter.
    • @jbosch: I like the simplicity (in e.g. provenance) of not reusing it. I think we need multiple PSF estimation stages anyway, so no cost to building it from scratch for each DR.
    • @rhl: Should start with at least Gaia (and WFIRST, etc) catalog, even if we don’t reuse our own DRs.
    • @jbosch: Definitely.

CONCRETE PROPOSAL

Reviewing https://github.com/lsst-dm/drp-docs/blob/master/tasks-top.yaml

General Principles:

  • Current proposal tries to minimize large-scale iterations if we think we can get away without them.
  • Should switch to preference for adding any large-scale iterations we think might be necessary; can triage later.

Specific Recommendations:

  • Iterate between Joint Calibration and Final PSF Estimation.
  • Iterate between coaddition and image subtraction to generate better templates.
    • @ktl: By “iterate” do we mean “iterate to convergence” or “do twice"?
    • @rhl: As long as we’re not assuming no iteration, it’s fine to assume it’s a fixed N, with N small (probably just 2).

Closing Discussion

  • @ktl: What more is left for the working group on DRP?
    • @jbosch: Not much. Need to synchronize with DPDD, LDM-151 needs a lot of work.
    • @ktl: Algorithm specification needs work though.
    • @jbosch: Can probably go two levels deeper before needing to experiment. Can at least enumerate the options.
  • @jbosch: I’ve been using YAML, because I want to think of it as a flowchart without doing the formatting. Is this a good idea? Is EA better?
    • @ktl: Some advantages, but could be hard to extract from EA once it’s in. Would help with relation to ops planning but probably not relevant. No compelling reason at this time.
    • @timj: Graphviz? OmniGraffle Pro can read .dot files.
    • @jbosch: Will continue with YAML, will hope someone writes a graphviz script for figures later.
  • @ktl: Where do other types of exposures fit in? Twilights, different exposure time? Deep-drilling?
    • @rhl: Coadding different exposure times complicates things. Otherwise processing is similar.
  • @ktl: Any new requirements for other subsystems?
    • @rhl: What about code to process WFS data?
      • @ktl: We get the code from T&S.
      • @rhl: Some concern over interaction with WFS team.
      • @ivezic: Should bring this up at PST. Also discuss issues of cross subsystem code review.
    • @rhl: Calibration Products Pipeline may have an impact on this work. Where do we first need chromatic information?
  • @ivezic: Multifit was main driver for sizing model. Are all these other iterations minor in comparison?
    • @ktl: Generally true. Open-ended iteration is a problem.
    • @rhl: Confident we won’t have that problem. I’d worry about the deblender scaling with density of objects. Multifit running simultaneously on blends is possible.
    • @jbosch: More exotic coadds and difference imaging may also drive sizing model.
  • @jbosch: Do I next update design documents or drill down on detail?
    • @ktl: Narrative text is easier to comment on.
    • @jbosch: LDM-151 is a possibility but would make it a larger document.
    • @ktl: Okay to rewrite it in whatever way is required is fine. Can be multiple documents.
    • jbosch: Makes sense to have discrete documents for AP, DRP, CPP (calibration) and SDQA with LDM-151 as an overview paper.
    • @ktl: Use latex or RST.
    • @jbosch: Need to sort out bibliographies in RST first.

Action Items

  • @jbosch will update concrete proposal as recommended.
  • @jbosch will then turn the yaml into a DRP-specific text document; summary of this will be one component of new LDM-151.

Are these the corrections that come from the auxiliary telescope?

The auxiliary telescope contributes, but there’s a lot more that goes into it (lab measurements, calibration frames, maybe data from a small full-sky camera, and possibly some things I’m not even aware of).

Just a side note that I’m very keen to build a bibtex equivalent for reST/Sphinx. It would integrate off of arXiv and ADS bibliography databases (e.g. :cite:arXiv:1603.00473 and :citep:2014AJ…147…109S`` but also allow aliases to arXiv and ADS bibcodes). Schedule-wise, I don’t anticipate this happening in the current 3-month cycle since we’re all-hands on QA at SQuaRE, but we can prioritize this for the cycle after).

Good to hear. At this point, I’m less concerned about getting support for this soon, and more concerned about future development hitting some sort of wall where we determine we just can’t do something important that latex/bibtex can. It sounds like you have some ideas of how you want to do this already; are there any serious limitations you can anticipate now?

Laying out the growth prospects of the reStructuredText-based doc platform we’re building is something that can deserve its own thread (in fact I’m actually writing up a technote/proto design doc on our documentation and communications strategy) but here are some thoughts on LaTeX versus reStructuredText:

  • LaTeX will always be more typographically optimized on a printed, 8.5x11, page than any web site printed to PDF will be. We intend to replace the visual design we ship with technotes and design documents and work heavily on print styles, but the fact is that Knuth did a good job.
  • LaTeX, from a SQuaRE point of view, is self-service (at this point). We don’t have the resources at the moment to work on LaTeX-based build/delivery. We especially don’t have the talent to work on LaTeX style files. The reStructuredText platform, on the other hand, benefits from our institutional knowledge of Python.
  • Math is important, and you can review Sphinx’s math support here: http://www.sphinx-doc.org/en/stable/ext/math.html As you can see, Sphinx’s math is essentially LaTeX’s, even including the availability of AMS-LaTeX. Let me know if there’s something missing here.
  • SQuaRE is focused on the reStructuredText platform. Development effort here improves the entire technote / design document / software documentation / data release documentation (tbd) ecosystem. Bibliographic support is an example something we can add to the ecosystem, among other extensions.
  • If you want to turn your YAML into a live d3.js visualization, we can do that in a reStructuredText→web workflow.

Obviously I’m bullish on the prospects of the reStructuredText based platform. The main limitation of the reStructuredText platform, at present, is that rate at which I can develop the platform (though the benefit of working within the open source Python ecosystem cannot be overstated).