Baseline Plan for Blended Fitting

Tags: #<Tag:0x00007fb38401bd98> #<Tag:0x00007fb38401b3e8>

One major discrepancy between the existing design documents and what I’ve been imagining (and what I’ve put in the DLP JIRA issues, and what I’ve been telling science collaboration members) - is in how we’ll handle model-fitting for blended objects.

The baseline, at least as I’ve understood it, is to use an SDSS-style deblender to divide pixel values between neighboring objects, then fit them entirely separately. We’d fit both a moving point source model and a galaxy model to all objects.

Instead, I think we should use the SDSS-style deblender to initialize the model-fitting, and to handle non-model measurements (e.g. adaptive moments, aperture fluxes), but report our final model-fitting sampling results based on simultaneous fitting of neighboring objects. I don’t think we want to specify exactly what “neighboring” means, yet, but from the perspective of shear estimation (which is the main driver for sampling) I think the overwhelming opinion of pundits (from e.g. the DESC) is that simultaneous fitting is preferable to a pre-fitting deblender, because it lets us characterize the correlations between the shears of neighboring objects. Note that I’m not declaring that simultaneous fitting definitely will work better; additional model flexibility in a pre-fitting deblender will almost certainly make it better at doing some things, and that may include shear. I just think simultaneous fitting is a better baseline.

Simultaneous fitting of both moving point source models and galaxy models is tricky, though, because you don’t want fit a complex blend with all possible combinations of models for each object, and you also don’t want to assume too much about what an object is. Instead, I think we need to come up with a hybrid model that transitions smoothly from a moving point source to a galaxy model at the point where the radius and the proper motion and parallax go to zero, and sample from that hybrid model. That will result in each object having some samples with each model, with the proportion determined according to the posterior probability. As with any computation we use that involves a prior, we’ll make sure it’s broad enough that people can use importance sampling to replace our prior with their own.

This simultaneous-fitting / hybrid model approach is what I’ve put into the plan in the JIRA DLP issues, and my preference would be to modify the DM design documents to match it, but I’ve started the discussion here instead of just creating a PR on the design documents because I wanted to get some feedback on it first.

I’m also not sure if the DPDD needs to be modified; it’s sufficiently vague on this subject that I think it’s not inconsistent with either plan, but these algorithms choices certainly aren’t just implementation details - they very much impact the delivered data products. I’d like some guidance from @mjuric on whether I should create a PR to flesh out the plan there as well, or whether we want to keep the DPDD to be at a more of a summary level.

2 Likes

Haven’t we always discussed doing this? At the very least we need to implement this to compare the results with the baseline.

There are hybrid approaches, of course. For example

  1. perform an SDSS-style deblend using heuristic templates
  2. Fit models to all the data, seeded from the SDSS images, resulting in some measurements
  3. Use those models as templates to redeblend
  4. Make some more measurements

Yeah, we’ve discussed a lot of these options a lot. But for the design documents my understanding is that we need to pick one to be the baseline (and hence be what observers who don’t have time to understand in our plans in detail expect), even if we plan to explore multiple options. And my understanding, at least back when we were preparing for the CDP review, was that pre-measurement deblending was understood to be the baseline.

I’d like the design documents to reflect reality. If we are trying one approach first, then trying another if that was insufficient, we should write that. If we are trying multiple approaches (in parallel or in series) and picking the one that works best, we should write that. But we should not be planning for someone to work on optimizing something without a clear criterion for stopping (time or performance or both) or without some idea of the deliverables along the way.

LDM-151 section 5.4.2.11 4th paragraph reads to me like @RHL’s hybrid approach.

Jim, does your simultaneous fitting scheme also include multiple point sources and work in crowded fields?

I’d like it to, with the caveat that the “what’s a neighbor” problem is really tricky there; we need to transition somehow from simultaneous fitting to iterative fitting when the blended area gets extremely large (i.e. each star would be fit simultaneously with its neighbors but iteratively with objects further away). I have some vague ideas on how to do that, but nothing remotely solid.

But I also recall @mjuric or @ktl claiming at some point that we were counting on being able to save on compute by not doing the Monte Carlo sampling at all in the plane or the bulge, so it’s possible what I’d like to do there is incompatible with the compute budget. I also could imagine a scheme where we just modify the prior to appropriately prefer stars to galaxies in those regions, which would cut down dramatically on the number of (more expensive) galaxy model likelihood evaluations there and potentially make it fast enough to do the full processing.

LDM-138 CPU Requirements spreadsheet, Data Release tab, column P does indeed account for per-forced-source computations for all objects except “galactic plane stars”, which are instead accounted for as stackfit-like coadd measurements.

it’s long been my desire to write a deblender that smoothly goes over to a crowded field code like Daophot/Dophot/Wolf as the prior on being a star goes to one.

1 Like

That’s actually even more limited than I thought, and it will get in the way of measuring proper motions or variability from stars fainter than the single-epoch detection limit (though I don’t know how much that was already unfeasible just given the information content of the images). I hope we can find the cycles to do it in the end, but it sounds like that needs to be considered a stretch goal at most.

I think “Galactic Plane Stars” should be interpreted as “Stars within XX degrees of the Galactic plane”, with XX ~ 5.

Okay, that’s great to hear!

And Magellanic Clouds and centers of globular clusters.

I understood we just didn’t do extended source fits (with all the glorious sampling) in the Galactic plane; all other measurements (esp. the moving point source model) would continue to be done.

PS: Note that sampling was only ever required for the extended source model (the assumption was that the cycles needed for everything else are negligible compared to the extended source model fits; I’m not sure if they’re even accounted for explicitly in the compute sizing model).

@jbosch: This definitely needs to be fleshed out, thanks for starting this discussion. I like the simultaneous fitting approach (mainly because it’s the obviously right thing to do (rah rah…)). The potentially scary parts are:

  • How does this affect computational and storage requirements? I suspect it’s OK for sampling, but what about maximum likelihood estimates (and covariances thereof)? [ Btw., I know the baseline approach has all these problems as well, just chooses to ignore them. ]
  • The hybrid model approach sounds reasonable – in effect, you’d be sampling a “moving extended source model” that reduces to a moving point source when a point source prior is assumed. That further increases the dimensionality of the problem, though. Will that be a problem?
  • Making everything self-consistant will be important. I’d be scared to have one set of deblends using heuristic templates, and in effect another using the results of the fitting. An iterative approach that @RHL mentioned will be needed.
  • Is this the time to further think on deblending priors that may vary depending on Galactic latitude (or known star clusters, etc.)? Crowded field photometry?

This is all highly non-trivial; I think it would be good to begin writing a short paper (page or two) with a proposed approach (proposed title: “The approach to multi-epoch measurement of blended objects in LSST”?). Basically, expand on the text you have here, with references to relevant prior art and what others are doing, why the current baseline is insufficient, and a proposal on how to proceed. After some iteration, we’ll use that to modify both the DPDD and LDM-151.

We’ve always known deblending would be messy (w. a lot of trial and error); writing down what we feel is currently a promising approach would be a good start.

I focused on sampling here largely because I think simultaneous fitting won’t affect computational and storage requirements for that it’s the same number of samples and model evaluations, the only difference is the order of operations. It might affect memory usage and hence the parallelization model, but those are already very much up in the air. For the same reason I think in the new baseline I’d say we shouldn’t plan to do any simultaneous fitting for maximum likelihood; I think that would have to involve covariance matrices that would scale badly with increasingly large blends.

I think we’ll have to bake the transition point into the sampler somehow, and I don’t have a solid idea for how to do that yet. But my intuition is that it won’t be that hard and it will make the increase in dimensionality a non-issue.

Agreed. This is also why I’d really like to avoid doing different things for different parts of the sky or for different kinds of objects. I do think we will have some measurements that have to be based on predeblending and some based on simultaneous fitting, but I don’t see that as a problem; we’ll just have to document it well.

I think this is a closely related algorithmic issue, and we may want to use the same priors in multiple places, but I think the deblending/fitting/measurement problem for crowded fields is actually much harder than the detection/how-many-objects and PSF estimation problems. I think the former is really just a matter of changing the prior on S/G classification and improving whatever divide-and-conquer approach we use for large blends in the high-latitude part of the survey. For the latter, I don’t see an obvious extension of our existing approaches that will work in crowded fields.

I’ll get started on that as a separate document in a branch of the LDM-151 repo for now. I figure that’s where most of the text will land in the end.

A first draft is now up for comments at https://github.com/lsst-dm/dm_applications_design/pull/3.

Jim, here are my comments on your “Measurement of Blended Objects in LSST” white paper. Sorry it took so long. Lots of other things have been keeping me busy.

I have some general concerns about the SDSS-style deblending of sources (I had to read RHL’s write-up on this to actual see how it works).

  • The measurements are not actually done on the “real” data but on the deblended images using imperfect “empirical” templates (it also seems quite circular since the final shape/flux measurements will depend highly on the original template shape). I think the best way is to do measurement on the real image using simultaneous forward modeling. I would rather have parameters of an (appropriately flexible) analytical function that was fit on real data (simultaneously with neighbors) than a measurement on a deblended image. The SDSS deblending already does a kind of simultaneous fitting anyway (of weighted templates to the data), why not do a “proper” simultaneous foward-model of all objects in the image?

  • The deblending is known to break down in crowded regions. The empirical templates assume that many pixels around the center of an object are dominated in flux by that object (or enough to make a decent template). Some tricks help deal with mild contamination (medianing, symmetric assumption), but once it becomes crowded enough these won’t work and the templates will become useless.

  • The deblending method can also miss objects that aren’t detectable as peaks in the image. For example, a faint galaxy that is behind many foreground stars. An iterative detection, fitting, subtraction and redection scheme is needed to these fainter objects.

I personally prefer to have a solution that works for all situations and not two methods (one that works for case A and another that works for case B) that are transitioned between depending on the environment.

Some ideas:

  • I think an iterative approach with detection, fitting, subtraction, redetection, refitting (now with sum of detections), etc. is a good way to proceed. This is typical for crowded-field PSF photometry codes, but it would need to include galaxy models.

  • Maybe a handy “trick” or technique might be to think of the image as being composed of a field of foreground stars and a background of fainter galaxies. Then proceed to first fit the bright-ish peaky objects as point sources and subtract then (could leave in anything that has a sub-par PSF chi-squared), then do extra detection (and preliminary measurement) of fainter stars and galaxies in the subtracted image. Combine the source lists and fit all objects simultaneously with forward modeling on the image. Repeat until no more sources are detected in the subtracted image.

I think it would be a good idea to test various techniques before we settle on any baseline plan. Getting feedback from experts outside of LSST would also be very useful.

Other comments:

Section 4.1, Model Selection
Can you explain more how the hybrid model would work? Is it a smooth transition between a moving point-source and a galaxy? What’s the advantage of a hydrib model over two models (one moving point source, one galaxy)?

Near end of Section 8.1, Baseline
You discuss including variability in the “multi-epoch” mode (which I presume is multi-fit). It greatly surprised me that this is even an issue. How can you not include variability. An important fraction of stars are varying at a level that will be significant in the LSST data (i.e. if we do not include that variability in our models, then our chi-squareds, uncertainties, etc. will be wrong). You discuss this more in Section 8.2.4 (Variability in Multi-Epoch Modeling). And I’m surprised that you say in the last paragraph that this might be a Level 3 product! Photometric variability MUST be a product of Level 2, otherwise we lose the great temporal sensitivity that LSST provides us.

At the very end of Section 8.1, you discuss using the simultaneous-fits as templates to create deblended single-epoch images for forced-photometry. Again, I find it strange to create deblended images for making measurements. I personally think that the multi-epoch data should be fit simultaneously (all objects, all frames) with a forward modeling approach, using a moving point-source model and a flexible (analytical) galaxy model (as you presented in Bosch 2010). This is similar to how DAOPHOT ALLFRAME works, except DAOPHOT only have a non-moving point-source model.

Section 8.2.1, Likelihood Coadds
You mention that the Likelihood coadds cannot be interpreted in the same way as traditional coadds and require completely new algorithms. Why is that?

Section 8.2.6, Deblend Template Translation
I don’t understand the need for template translation in a forward modeling framework. The idea is that you model all of the individual epoch PSF (and other effects) so that you can “get at” the underlying distribution of flux. I don’t see why any translation is needed if the foward modeling is done correctly.

Somewhat relelvant questions:

  • Is there a write-up of the multi-fit plans, coadd plans?

  • What’s the difference between multi-fit and forced-photometry?

1 Like

Thanks for all the detailed comments. Replies below.

I agree, and we do in fact simultaneously forward-model groups of overlapping objects for what I consider our best measurements (P5, P6, & P7 in the flowchart). If we’re in a field so crowded that such a group extends to cover an entire image, then I do think we need to either simultaneously forward-model those or use an iterative procedure that amounts to the same thing.

But I think there’s still a role for an SDSS-style deblender to play:

  • Doing non-simultaneous model fitting to individual objects on deblended pixels first will let us get close to the best-fit parameters without having to minimize an extremely high-dimensional problem. In this role, the SDSS-style deblender is just an implementation detail that improves the performance (and possibly the robustness) of the simultaneous fitting.
  • We still have to make some “old-style” measurements that can’t be formulated well as model-fitting (or at least not in a way that would make simultaneous fitting sensible), such as aperture fluxes or Gaussian-weighted second moments. For these, I don’t think we have any choice but to use deblended pixels. I think it unlikely these measurements will outperform model-fitting results on faint sources (especially in crowded fields), but I expect they’ll still be useful for bright, moderately-isolated objects.
  • At present, I’m recommending that the baseline use deblended pixels for forced photometry measurements, but I think this is an area where we need to try several options before settling on any of them, and one of those options is simultaneous fitting with free amplitudes for every epoch. Another is doing forced photometry on difference images. For this task, choosing the deblended-pixels option as the baseline is a bit arbitrary; none of these options have been tested and all of them are problematic in different ways.

The key trick that will help in this area is replacing the template with the PSF model when appropriate, which (if done to all objects in a blend) does reduce the deblender to a true simultaneous fitter. The catch is the question of how to know when to do that. In the SDSS deblender, I believe this was done only when a derived template was found to be similar to the PSF model, and it may be that for LSST we’ll have to approach this from the opposite direction, and only use symmetry-ansatz templates when a PSF template is inadequate. Robert has been talking for a while about using the aggressiveness at which we choose PSF templates over symmetry-ansatz templates as a parameter we can vary smoothly as a function of (e.g.) galactic latitude, so when there’s insufficient information in the data we use PSF templates in crowded star fields and symmetric templates when we expect moderately-isolated galaxies. There are many details that need to be filled in, but I think something along those lines will work.

I think you’re probably right, but I’d actually intended for detection (including iterative fitting and subtraction as part of detection) to be beyond the scope of this particular document (and, honestly, I hadn’t thought much about the need to do iterative fitting and subtraction there until you brought it up in Bremerton, so I’m very glad we had that talk). There’s obviously some wasted effort if we do fitting both during detection and later during measurement, but I very much like the idea of having everything detected and included when we do the final simultaneous fitting that we use for output parameters. I’m a bit worried that if we use the iterative fitting we do as part of detection to do actual measurements they’ll be biased significantly by the objects not detected in that round. That’s why this document starts from the assumption that all objects have already been detected.

Ideally, I’d prefer a method that doesn’t have to transition between modes as well, and making use of a prior that assumes a particular star vs. galaxy fraction as a function of position on the sky to make that transition obviously runs the risk of biasing the result in the direction of the prior. That said, I think we can produce algorithms that are better in almost every respect if we include that prior, because so much of what we do is in a poorly-resolved, low S/N regime where the likelihood doesn’t really tell us much. Put another way, the fact that nearly everything faint is a galaxy at high-latitude and nearly everything (period) is a star in the bulge is not information we can afford to ignore. The real key here is making it possible for users to “back out” our prior and substitute one of their own; we’ll have to do that in many other places in the pipeline (e.g. nominal SEDs used to correct for chromatic PSF effects when measuring photometry). I know how to do that for the sampling measurements (P7 in the flowchart), but I hope to come up with an approach that works for the minimizer-based fitting as well.

Leaving aside the question of priors and algorithm transitions, though, it sounds like we actually have similar ideas in mind for detection. I think we absolutely will need some amount of iterative fitting and subtraction, but it’s a research project to determine how much (and whether we gain anything by being explicitly more iterative in crowded fields). But I’d like to defer most of that discussion until I have a chance to write a similar document for detection and deblending.

This an area where @ktl and @jdswinbank have been pushing for me to be more explicit and less expansive. It’s relatively easy for me to enumerate all of the options, and I’d like to just be able implement them all and have a bake-off, but given finite budgets and schedules we also need to identify some areas where we’ll implement one option and only explore others if the first doesn’t “meet requirements”. In many cases, those requirements haven’t really been written down, and that’s the first thing that needs to be addressed. We’ll still have some build-it-all/bake-off design choices, but we may not have as many as we’d like.

Agreed. One of my biggest takeaways from the DESC meeting and @jsick’s post on the MW/LV meeting is that we need to get our design documents in order and at quite a high level of detail to enable the science collaborations to comment on them effectively. Ad-hoc conversations with DM team members about algorithm plans clearly isn’t communicating what we’re planning well enough (perhaps because our plans are too vague even in own minds until we’re forced to write them down - I certainly discovered that this was the case in writing this document).

I was intentionally vague here, because I’m not sure exactly how it will work, but it is a smooth transition between a moving point source and a galaxy; both models share a point in parameter space whether the motion parameters and the radius are zero, and from that point it can go in either direction.

The advantage is in avoiding combinatorial simultaneous fitting problems - if we’re fitting 3 objects simultaneously, we don’t want try all 2^3 (star-star-star, star-star-galaxy, …) possibilities, and it’s clearly impossible for a group of 50 objects. Using a hybrid model lets us use one model for all objects, and hence do only one simultaneous fit, while still exploring the possibility that each object is a star or a galaxy in proportion to its likelihood.

I certainly never intended to imply that variability would be level 3; it’s absolutely level 2, and in the baseline plan that’s what the forced photometry is primarily (entirely?) for. The question is how best to measure it, and if we ignore for now the possibility of doing these measurements on difference images, the two remaining options both have some potential problems:

  • Including variability directly in the multi-epoch simultaneous fitting, as in 8.1, is the more principled approach, and the option with the higher ceiling. The problem is that it’s a huge increase in the degrees of freedom in a regime where the other parameters of most objects are already poorly constrained due to lack of S/N, and most of those objects will only be variable at a very low level. If we had a model for the actual lightcurves we could include that had fewer parameters than the number of epochs, we’d obviously avoid this problem, and we might be able to avoid it with a sufficiently informative prior on just the level of variability. Without that, I think there’s a very real danger of overfitting the noise.

  • The baseline plan, in which we fit simultaneously first without including variability, then use those models to deblend pixels and measure new fluxes for every epoch, obviously does the wrong thing in terms of getting the chi^2 values and uncertainties right at the first stage. But if the level of variability is low compared to the average flux, just treating variability as an extra source of noise (i.e. underfitting) initially may produce better lightcurves in the end.

Formally, the problem is that we don’t have good models and priors for the lightcurves themselves, and hence we’re stuck with either a too-flexible model (which asserts fluxes at different epochs are entirely unrelated) or a two-step process that incorrectly considers fluxes independently from positions and shapes. The experience from DAOPHOT ALLFRAME (that you don’t have obvious overfitting problems from including per-epoch amplitudes, at least with a smaller set of centroid/shape parameters) is a useful one.

In any case, this is definitely an area where I want to build both and bake-off, and I brought up level 3 just to motivate that point - even if we decide that what I’ve labeled as the baseline plan is better overall than including per-epoch amplitudes directly in multi-epoch fitting, we need to implement it anyway since users want to use it (presumably with more informative priors or more restrictive models) on subsets of the dataset in level 3.

Likelihood coadds aren’t images of the sky - each pixel is proportional to the log likelihood of there being a point source at the center of that pixel, instead of being proportional to the flux from the object at that position. That means that a PSF-weighted centroid, for instance, is just the point where the first spatial derivatives of the image (with some choice of interpolation) are zero.

There’s nothing deep here - translation is indeed unnecessary if the templates come from forward modeling. It’s only needed if the templates are defined some other way (such as the SDSS symmetry-ansatz templates). Forward-modeling templates are better for a ton of reasons, but it remains to be seen whether we can devise models that are flexible enough to deblend bright galaxies as well.

This document is as much of a writeup of multi-fit as I was planning to do for a while. Coaddition deserves its own in-depth writeup (@jdswinbank has started on this in another branch of LDM-151), as does Detection and Deblending. I’ll return to fleshing out multifit further when those are done.

I hope the boundary between multi-fit and forced-photometry is already a bit more clear from some of my responses above, but I think it’s a fair to say that the distinction is some combination of historical artifact (in that it’s the way most astronomers think about the problem) and an iterative rather than simultaneous approach for fitting positions and amplitudes:

  • Multi-Fit is any kind of multi-epoch fitting, which may or may not include per-epoch amplitude parameters;
  • Forced photometry is just fitting per-epoch amplitude parameters while holding everything else fixed, and it’s probably not necessary if multi-fit does include per-epoch amplitude parameters.

Forced photometry was part of the baseline plan for LSST long before multi-fit was considered, so it’s hard to remove it from the collective nomenclature even if it really is just one of several implementation options for how we measure variability. It’s also, at present, the only option we have implemented for measuring colors or variability.

1 Like

By the way, any comments you have on how to make the text of the document more clear on these issues would be very welcome, though for that it’s probably best to comment at GitHub.

Hello Jim, all - there has been some discussion on the lsst-milkyway science collaboration mailing list about crowded field photometry and what our needs are likely to be. It sounds like some of the community’s questions will be answered at the 2016 Tucson meeting.

In the meantime, however, I’ve pasted pieces of that discussion below in this message. Hopefully this will help the DM experts get a sense of the discussion in the community.

Cheers

Will

===========

Dear LSST Stars, Milky Way, & Local Volume colleagues,

At the August 2016 LSST Community Workshop there will be a session on Blended Objects (Wed, Aug 18, 1:30-3:00) including presentations by the LSST pipeline (DM) and various science collaborations. I’ve been asked to say a few words about our needs for crowded field photometry. A group of us discussed this at the June 2016 AAS meeting. My notes from that meeting said we had the following messages:

  1. Message to project: Crowded field photometry is important! (And it’s not the same as de-blending a pair of stars or galaxies.)
  2. Questions for project: What are the plans? Is crowded field photometry following “Model A: Project takes it on” or “Model B: Community takes it on”?
  3. Goal: Get access and test the DM crowded field pipeline, through simulations and/or real data.
  4. Goal: Improve communications between project and SMWLV.
  5. Related remark: Current observing strategy tends to get most of the crowded Plane data in first 200 days, very little thereafter.

This is not my field, so I’m asking for help. I know that some of you have extensive experience with DECam and similar data and opinions on what is needed. Anything – comments, big picture messages, science goals, resource estimates, figures, pictures, slides – is welcome. Just reply to this thread so we have it archived for the future.

===================

Many thanks to xxx for rattling the SMWLV cages.

  1. Crowded field astrometry is just as important as crowded field
    photometry, but then again to a hammer everything looks like a nail.

  2. Proper motions in crowded fields sounds like a good idea to pursue.
    Blends with objects that do not move look sort of the same (modulo
    seeing, depth, etc.) each time, but blends with motion objects
    will change over time.

  3. Sure would be good to get parallaxes in crowded fields, even at the
    degraded level mandated by the Cadence with less time given to
    such fields. Is this a bridge too far?

  4. From what little time I have spent playing with DECaLS data, it is
    totally critical to get at least 3 epochs, and not just 2, for trying
    to dig motion out of blends. Were I in charge of the Cadence, I would
    arrange the crowded fields to be visited several times during the
    baseline 10-year survey, and not just in the first few months.

  5. What ever happened to difference image processing for crowded fields?
    I have not heard anything about DI in recent years, but then again
    I am pretty unplugged these days. Many years ago, DI was promised
    as the panacea for crowded fields.

  6. A huge worry that I have with Gaia is how well it performs in crowded
    fields. With two fields of view on the focal plane at the same time,
    it sounds like yet more of a mess than just one. Perhaps we will get
    the first insight on Sep 14, but then again this first data release
    will be only the “good” objects. But if there are issues or a brighter
    Gaia limiting magnitude, then this is yet another area where LSST
    will be a major step forward.

===========

I’d like to suggest a few specific questions/action items, none of which require a huge time investment:

  1. Can someone volunteer to assess gaia’s performance in crowded fields using the upcoming September 14th data release? Even something simple like a spatial map of the flux at which the astrometric error for “good” objects reaches X milliarcsec, would help determine where the bright end of LSST’s discovery space will be in these fields (including e.g. handling bright objects). Since there will likely be lots of eyes on this data when it arrives, we may want to co-ordinate at the Tucson meeting to avoid duplication of effort.

  2. Can we (the community) contribute any data from our experiences on the computing time currently required to generate catalogs from large format crowded fields using industry-standard software?

This will help determine what scale of hardware will be needed to support crowded field science - for example if it would be practical for the community to simply run standard tools on LSST data using institutional hardware available to investigators, and then contribute the point-source catalog back to the LSST project as a level-3-type data product.

My impression is that the above outsourcing-type scenario is the model the LSST project currently has in mind for crowded fields. Currently my concern is that the LSST project and the SMWLV community might each be assuming the other will deliver crowded-field astrometry and photometry!

  1. [Assuming LSST project representatives are on this mailing list:] Can the LSST project provide information at the Tucson meeting about what the plan currently is for crowded field processing by the project? For example, is there a plan to smoothly transition from non-crowded to crowded processing methods? If so, what methods are planned?

  2. I believe a test of the performance of some version of the LSST stack on a number of crowded field datasets is underway, but I confess I have not kept up with this effort (led I believe at NOAO?). It would be good for us to hear about the status of these efforts at the Tucson meeting.

  3. If anyone on this mailing list has been actively pursuing difference imaging in crowded fields on wide-field data (like HSC or DECam), could they please provide a status update on how well it works?

Thanks all - my apologies if any of the above duplicates presentations or sessions you are already planning!

===========

I have annotated my responses. I would also like to add a question to your excellent list:

For assessing systematic errors, a common tool is to add “artificial stars”, and see how their positions and mags are recovered. The analysis of LSST simulation images is useful too, to get an intital handle on the effects of crowding – well before any data actual coms out. Is there / should there be a plan to generate useful simulations of crowded fields using
IMSIM, and work on them?

  1. Can someone volunteer to assess gaia’s performance in crowded fields using the upcoming September 14th data release? Even something simple like a spatial map of the flux at which the astrometric error for “good” objects reaches X milliarcsec, would help determine where the bright end of LSST’s discovery space will be in these fields (including e.g. handling bright objects). Since there will likely be lots of eyes on this data when it arrives, we may want to co-ordinate at the Tucson meeting to avoid duplication of effort.

  2. Can we (the community) contribute any data from our experiences on the computing time currently required to generate catalogs from large format crowded fields using industry-standard software?

  I might be able to help with this from experience using DoPHOT.  Working on a typical lap/desk-top, with no attempt to parallelize, with crowding at mean nearest neighbor distance for r<24 of about 1.5 arc-sec (Baade's window) in 0.8 arcsec seeing, its about an hour per million objects. Includes processing that feeds DoPHOT, and calculates field dependent aperture corrections.  The process can be parallelized in a a number of ways, so it should scale well with no. of processors.

This will help determine what scale of hardware will be needed to support crowded field science - for example if it would be practical for the community to simply run standard tools on LSST data using institutional hardware available to investigators, and then contribute the point-source catalog back to the LSST project as a level-3-type data product.

 Dont neglect the fact that getting the images to individual investigators is a bottle-neck for LSST.  I think we will need to vet code beforehand and install on computing centers/ data labs.

My impression is that the above outsourcing-type scenario is the model the LSST project currently has in mind for crowded fields. Currently my concern is that the LSST project and the SMWLV community might each be assuming the other will deliver crowded-field astrometry and photometry!

 Last I heard, crowded field photometry (and astrometry?)  is not an LSST deliverable.  My information may be outdated.  Good to ask at meeting.
  1. [Assuming LSST project representatives are on this mailing list:] Can the LSST project provide information at the Tucson meeting about what the plan currently is for crowded field processing by the project? For example, is there a plan to smoothly transition from non-crowded to crowded processing methods? If so, what methods are planned?

On a atechnical aside: I also happen to think that at r~27 (for co-added) or even r=26, which will be reached by year 2, almost EVERYWHERE in the sky is crowded by galaxies. This means that working definitions of “background” will break down for smaller and smaller spatial scales. For which reason I am an unbeliever in “non-crowded” techniques. I could be convinced otherwise, with a good enough demo. The “industry standard” crowded field methods deal with this: even here, some are better than others at extreme crowding levels.

  1. I believe a test of the performance of some version of the LSST stack on a number of crowded field datasets is underway, but I confess I have not kept up with this effort (led I believe at NOAO?). It would be good for us to hear about the status of these efforts at the Tucson meeting.

  2. If anyone on this mailing list has been actively pursuing difference imaging in crowded fields on wide-field data (like HSC or DECam), could they please provide a status update on how well it works?

I am keen to learn the answer to this question.

========

I don’t know if Sergey Koposov or Vasily Belokurov are members of this group, but one can learn a lot
by reading their paper http://iopscience.iop.org/article/10.1088/0004-637X/805/2/130/pdf.
In my view it shows that to do good photometry you need point-spread-function fitting (I know there
are many exceptions). And that it takes a LOT of computer power.
Sergey did all of DES year 1 in an unbelievable short period of time by using a huge number of CPUs.

I recall another paper, but not the authors, where they used PSF fitting on some cluster captured in SDSS
and did a lot better than what comes out of the pipeline.

============

An et al. (2008) did this for globular clusters, and Smolcic et al. (2007) for the dwarf galaxy Leo I.

=============

Glad to see this conversation happening. A few quick answers for now:

The LSST pipeline already implements PSF fitting for point source photometry - Jeff Carlin led a paper, soon to come out on the arXiv, that uses the point source photometry from an early version of the LSST stack, applied to our group’s Subaru/HSC data, to study a newly discovered local volume dwarf. Handling the background is the trick - the default background algorithm didn’t do well with the partially resolved dwarf, which Paul Price was able to tweak for us to obtain well calibrated, well measured photometry of the RGB stars.

The UW group is doing difference imaging for their Level 1 work.

===============

1 Like