Access patterns from Qserv / shared scan perspective

jbecla · December 1, 2015, 5:31am

Two questions:

will our users typically run analysis across large subset of data that involve Object and Source and ForcedSource at the same time? The reason I ask is that we have tentatively planned to run Object+Source share scan independently from Object+ForcedSource share scan, because it will allow us to run each scan with an optimal speed, but perhaps that unnecessarily limits data analysis. If we would run these two scans independently, users would have to save results from one scan and then join with the other scan, effectively waiting for two scans instead of one.
How often will users want to join data across data releases / how easy do we need to make it? We have been tentatively considering serving each DR from a separate cluster / through a separate Qserv, but if users will frequently try to cross match across different data releases, perhaps we should reconsider.

Do keep in mind that synchronizing everything (Object and Source and ForcedSource and ObjectExtra etc from multiple data releases) would results in each scan running slower, it’s require more resources, plus, it’d complicate Qserv implementation.

Thanks,
Jacek

RHL · December 1, 2015, 2:12pm

I doubt if people will often want Object, Source, and ForcedSource. (DiaForcedSources are a different matter). In fact, if we do things well I don’t think ForcedSource will be used much at all.
I wouldn’t expect much cross-DR work. It’ll be done, but only when people want to know what’s changed between releases. I’d expect people looking for the same object in a new DR will do a spatial match.

I’m a bit worried that this decision apparently changes how you implement qserv.

ktl · December 1, 2015, 2:34pm

I think you mean Source (single-epoch measurements) here, not ForcedSource, which will have the light curves for every object.

These decisions change how we configure the Qserv installations; they shouldn’t change the code for the DBMS itself.

RHL · December 1, 2015, 7:04pm

So you’re linking the diffim light curves into Source?

ktl · December 1, 2015, 7:32pm

The diffim light curves are DiaSources, no? They get linked with DiaObjects which get matched with Objects. But I think it’s still the case that there are many more ForcedSource-based light curves than DiaSource-based light curves, even with forced DiaSources – so we weren’t planning a shared scan with Object/DiaSource. And using the DiaSources for light curves may be tricky because of template changes.

RHL · December 1, 2015, 7:45pm

Hmm. We have a potential problem here (I’d be interested in @mjuric’s take). I’m expecting that we’ll measure only non-variable objects as Objects, measured in multi-fit (which will allow for proper motion and parallax so variable in space is OK until you start worrying about light echos and solar system objects with accelerations). I thought that we were planning to do forced photometry on the difference images, and that’s where the light curves come from. The use of multiple templates makes this nastier, as @ktl notes.

The SNe Ia folk (e.g. @mwv, @PierreAstier ) like to do their photometry by forward modelling, but I think (although I haven’t demonstrated) that this is equivalent to a particularly good choice of template. This needs to be sorted out.

It’s true that we are also planning to measure Sources and ForcedSources on individual exposures, but I’m unconvinced of the value of these measurements. Because the community fervently believes that we need to make these measurements we should plan to until we’ve demonstrated that the ForcedDiaSource approach is better.

ktl · December 1, 2015, 8:02pm

Please cite DPDD language (or lobby @mjuric to change it). I see 5.2 Level 2 Data Processing item 7 and 5.2.4 Forced Photometry defining ForcedSources and saying they are for light curve characterization and 4.2.1 Difference Image Analysis items 9 and 10 defining forced DiaSources (after a DiaObject is created and 30 days before) in Level 1.

Regardless, if we replace ForcedSource by ForcedDiaSource (with just a flux), or, more likely, replace the flux in ForcedSource with one from a difference image, the database access pattern is the same. It’s only if we have a different kind of join that we have any issue.

jbosch · December 2, 2015, 5:02am

My understanding is the following:

Most Objects will come from deep detection and represent mostly-static Objects, but we should create some from DiaSources as well so we can do forced photometry on them and include them when deblending mostly-static Objects.
ForcedSources may or may not be measured on difference images, but they’ll happen at the position of every Object and represent our best measurement of that object’s lightcurve, so a join with the Object table will be common and simple (no spatial matching required).
Sources will contain almost no information that isn’t measured better in ForcedSource, DiaSource, or Object, and I think that will relegate them to being useful only for QA or very rare/exotic things (I can’t actually think of any). That’s assuming we can discount queries by users who don’t know better; that may be difficult (in the sense that many scientists are accustomed to using Source-like catalogs for their science, even if we don’t consider them our best measurements). Even in that case, I expect Source/ForcedSource joins of any kind to be rare.
DiaSource/Object/ForcedSource joins may be common, though we could avoid most of the need to include DiaSource in those joins by including aggregates of DiaSource quantities in Object for DiaSource-derived Objects.
DiaSource/Object should not require a spatial match in the database, because we’ll resolve those connections in pixel-level processing (or MOPS?) when we create Objects from DiaSources. I had imagined that Source/Object would require a spatial match in the database, because the connections are inherently ambiguous due to blending and users may want to resolve those ambiguities differently.

swinbank · December 2, 2015, 5:45pm

My reading of LSE-163 is the same as @ktl’s here. To that end, the current plan as recorded in JIRA (and now PMCS) is that ForcedSources are not measured on difference images. However, there is a stretch goal (DLP-730) of doing so; this is not required to hit the current baseline.

Isn’t this broadly what DiaObject already aims to achieve?

jbosch · December 2, 2015, 7:51pm

Yes, I just forgot about DiaObject’s existence as something distinct from Object.

ctslater · December 2, 2015, 11:47pm

The language in the DPDD is:

The master list of Objects in Level 2 will be generated by associating and deblending the list of single-epoch source detections and the lists of sources detected on coadds.

There is currently no mention of filtering the sources, only association and deblending. If we feel that some sort of filtering needs to take place (so that not every 5 sigma peak in any image ever gets an object) then we should specify that in the DPDD. Or if we believe the policy should be to not filter, then we should directly say so and resolve the ambiguity.

One other point: ForcedSources are a requirement for variable stars; I don’t believe you can use difference imaging since the template will be a mixture of different phases of the source (assuming its a coadd). And I agree with @jbosch that Sources will ideally be unuseful compared to ForcedSources but are invaluable for QA. Getting back to the original question, the only time I can think of when you would want Object+Source+ForcedSource is QA not science, so it seems acceptable to have extra steps involved in that.

jbosch · December 3, 2015, 1:46pm

If the idea that we’ll build any Objects by associating Sources from single epoch detections is still in the DPDD, I think we need to fix that. I was under the impression that was an old idea that had been discredited a while ago within the project. I do think we’ll associate DiaSources to build Objects, and maybe that’s what was meant here, but then we should be more explicit. Can anyone think of a use case for including single epoch detections I might have missed? My feeling is that any single-epoch detection will be detected at higher S/N in either a coadd or difference images.

If we do use difference images to measure ForcedSources, we will indeed need to solve this problem. But that might still be easier than (e.g.) solving the problem of measuring SNa light curves in the presence of unknown host galaxy morphology on regular single-epoch images, which is why I think we need to consider both options.

swinbank · December 3, 2015, 5:15pm

@ctslater correctly quotes LSE-163 §5.1. In §5.2 para 5 this idea is refined somewhat:

My reading of this – which I think we’ve discussed before – is that it leaves the pipeline a lot of latitude: a “consideration” of Sources may result in deciding they add no worthwhile information and can be ignored. For clarity, we may wish to reword §5.1 slightly: DM-4528.