Improving and unifying code for star selection

jbosch · February 16, 2016, 6:37pm

It looks like one of the remaining issues on Testing DM-4692: the new ProcessCcdTask is astrometry problems on CFHT, due to some combination of the following:

We don’t have a good distortion model for this camera, making it likely we’ll get bad matches unless the reference catalog and our own measurements are sparse.
We’re matching a deeper set of detections to the reference catalog on DM-4692 and deblending them, making our own measurements much less sparse.
We don’t have enough flexibility in filtering our own measurements before we feed them to the matcher.
Our fitter doesn’t do any outlier rejection, making it very susceptible to bad matches.

While we really need to fix at least one of (1) and (4) (@price, @rowen, and @boutigny are working on this now), and (2) should be an improvement in the absence of these other problems, I’d like to focus on (3), which may be the easiest way to get things working for now, but which also raises some code duplication and interface issues.

We have multiple operations that need to select a sample of reasonably bright, unsaturated, possibly isolated stars in single-frame processing: PSF estimation, aperture correction estimation, astrometry, photometric calibration. Eventually, external catalogs may play a role (that’s a separate issue I don’t want to deal with now), but we’ll always need to do some filtering based on our own measurements, rejecting objects due to some flags or other cuts, or possibly selecting objects due to some more complex criteria (“the brightest N objects in each spatial cell”).

I think we want some sort of unified system for specifying these sort of selections, which we could view as a generalization of the StarSelector interface. The problem is code reuse: do we want to have a selector that checks for condition A, another that checks for condition B, and two more to check for “A and B” as well as “A or B”? Some sort of expression parser would perhaps be the most natural approach for simple filtering, but even if we did want to bite off writing one, some important operations can’t be written out easily in a single line. Passing callable objects (sometimes lambdas, sometimes more) into configuration would be a a really slick approach, but I strongly suspect that will make config persistence a nightmare, and probably raise some security concerns.

Right now, I think just having a plugin system with a few multi-purpose general implementations (e.g. “AND together a lis of flags”), along with letting plugins compose themselves via actual class composition seems to be the approach with the lowest activation energy. Any other ideas?

ktl · February 16, 2016, 9:00pm

This of course sounds a lot like SQL. Can we not use that (or at least one of the existing reimplementations) in some way?

jbosch · February 16, 2016, 9:37pm

It certainly would be nice to be able to write SQL (or maybe just WHERE clauses) for the simple filters with lots of composition it supports. But we’d need a way to parse it in a way that lets us filter in-memory tables based on NumPy (afw.table for now, perhaps astropy.table or Pandas in the future?). Or we’d need to have a clever table class with a SQL backend and a NumPy-friendly interface. There may be existing implementations for some part of that - it certainly seems like something a lot of astronomers would use, if it existed - but I’m not aware of any.

ktl · February 16, 2016, 9:40pm

By “reimplementations”, I was thinking foremost of Pandas, which appears to have much of SQL’s capabilities.

jbosch · February 16, 2016, 9:42pm

If Pandas can do SQL (or something similarly featureful in describing expressions), I’m all for it. My knowledge of Pandas is sorely lacking.

timj · February 16, 2016, 9:49pm

http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

jbosch · February 16, 2016, 10:09pm

Oh. I guess I’m not really impressed by that, actually. Relative to SQL, that requires the name of the variable holding the table object to appear in the expression when you convert it to a string, and there’s a lot of repetition of that name, as well as a lot of extra punctuation. So it’s not nearly as nice as just putting a SQL phrase into a config field for either the user or the implementer. I imagine we could still use it with some string preprocessing and some Python eval calls, but doesn’t look qualitatively different from afw.table, astropy.table, or even vanilla structured NumPy arrays in terms of supporting WHERE clauses. At most, it seems like NumPy boolean indexing with the bitwise operators overloaded so you don’t have to write e.g. numpy.logical_and out by hand (and they may not have even done that - I can’t tell from the examples).

Pandas support for more complex SQL-like operations (JOIN, GROUP BY) does look very nice, but I don’t think it’s something we really need in this particular context (they certainly seem useful for other things, especially in QA and interactive algorithm debugging).

ktl · February 17, 2016, 2:48pm

While the syntax may not be as compact as possible, using one popular package seems to me far better than using something custom or a variety of different non-interoperable packages (e.g. NumPy and an in-memory SQL database) in different places.

On the other hand, if configurability at this level is required, then the tradeoffs start to move in a different direction (but still against custom code).

jbosch · February 17, 2016, 6:04pm

I’m certainly not suggesting we write a competitor to Pandas ourselves, just pointing out that Pandas isn’t actually any better at this any any other table library. It is indeed more popular and undoubtedly better (than at least afw.table) in myriad other ways.

I do think putting the SQL phrase into a config field is essentially the ideal - it moves as much logic as we’d want from code to config, but no more - but it’s not a requirement.