Open questions after generating Gaia DR2 refcat

The process of generating the Gaia DR2 reference catalogs has left me with some open questions about how we produce our refcats and what goes into them. The DR2 refcat I made should be useable for astrometric (but not photometric) fitting. However, there are ways it could be made more useful, and there are components of our refcat system that are currently unused.

  1. Our catalog normalization process (“ingest”) requires certain things to be true about the catalog we’re ingesting from, which may not always be true. For example, we get our fluxes by converting from a magnitude column and we assume that a magnitude error field is also provided. Gaia does not provide magnitude errors, because “…the error distribution is only symmetric in flux space.” The Astronomers’ reliance on passing around magnitudes strikes again!

  2. We have thing_flag columns in our refcats but we mostly do not use them and have not specified what they mean other than “don’t trust thing“. Thus, I didn’t implement parallax_flag or flux_flag for Gaia, because I didn’t know how to choose what “trustworthiness” meant. Converting each external catalog’s idiosyncratic definitions of “good/bad” fields would require some specialized sub-classing of our Ingester, which didn’t seem worth it in this case. And the work I’ve done to parallelize the ingestion process may have made such future subclassing more difficult.

  3. The external catalogs may provide other fields that are useful: we have a way to just send them straight over, but that doesn’t mean anyone can use them (if foo is only in the gaia refcat, you can’t write code that assumes it will be in every refcat). Sometimes those fields do contain information we need, but in a way that again requires a specialized subclass. For example, Gaia has a field for whether a source is variable, but it’s a trinary string field, and thus not trivial to convert into our boolean “is_variable” flag.

In generating the Gaia dr2 refcat, I decided to take the expedient route and just get something done that we could use. I think we’re probably going to have to wrestle with these questions before too long, and I wanted to at least get them written down so they are not forgotten.

I almost feel like we should rope a few people (Jim, Yusra, Paul at least?) together for an hour or so at the LSST August meeting to hash out an approach?

1 Like

I don’t understand the comment about magnitude errors. That is, Gaia provides flux errors, so why can’t we use them to calculate the magnitude errors in a way that’s useful.
Is the problem something about how we did the export from Gaia?

I’ve been avoiding bringing this up because I don’t have a lot of time to devote to following up on it, but I’ve had the sinking feeling for a while that we should really just be ingesting reference catalogs more or less “as is” and instead write custom loader tasks/classes for each one, rather than trying to normalize them on ingest. We’d probably still have to shard them and perhaps filter out columns in ingest, I suppose, but by putting so much normalization in the ingest step we make it a lot more likely that we’ll have to re-ingest a lot.

They provide instrumental fluxes and errors (in electrons/second), not calibrated fluxes and errors. Search for “phot_g_mean_flux” on this page:

This approach still requires some specialization code, and we still have to do the sharding, so I’m not sure it matters that much from a “quantity of human work” perspective. Practically, if we’re already having to read things in order to shard them, we might as well normalize them at that time, to make reading them more trivial.

I don’t understand what you mean by “re-ingest a lot”?

Right now, any time we want to change something about what “normalized” means - what columns are present, what they mean (e.g. changing flux units, adding proper motions, rethinking uncertainty representations, defining better flags) - in a way that would affect how we would extract them from the native catalog, we have to re-ingest or at least patch the files.

Re-ingesting is serious work in its own right, but because we need those formats to be backwards compatible (at least for some deprecation period), normalizing at ingest doesn’t actually save the loader code from having to be able to read multiple formats, And then we need to worry about when it’s okay to remove obsoleted versions of old reference catalogs, considering that they may have been used to create processed datasets we might want to reproduce.

Contrast that with just ingesting native catalogs more-or-less as is: the ingested files never have to get touched, and the loaders (while we have a different one for each upstream catalog etc.) never have to support different on-disk versions of those. Updates to what the matching and calibration code expects are just regular code changes with no special backwards compatibility needs. For that to work you really do have to include all columns you might ever need when sharding them, of course, but if we did this with a column store format like Parquet, even adding new columns we didn’t anticipate needing originally is an easy additive-only patch with no backwards compatibility issues.

I probably misunderstood, then. The Gaia_dr2 catalog on lsst-dev at /datasets/hsc/repo/ref_cats/gaia_DR2 (linked to /scratch/parejkoj/gaia/refcat/ref_cats/gaia-dr2) doesn’t seem to have magnitude errors (i.e. phot_g_mean_mag_fluxErr is NaN), but as it isn’t officially announced I was making assumptions about what you’re doing. To clarify, are you calculating errors as part of your new ingestion from e.g. phot_g_mean_flux_over_error?

Given everything I’ve said above, I have not computed flux errors for our Gaia DR2 refcat. The catalogs you’ve linked will be RFCed this week once DM-20756 is reviewed; unless that review points out any errors, I am not planning any “new ingestion”.

@RHL: I’ve replied to your question about flux errors on the relevant ticket.

@jbosch: that’s an interesting idea. I’m somewhat warming up to it as I ponder it. I think it comes down to the balance between 1) how often we think our refcat format will change, 2) how many different places we expect to get external catalogs from, 3) and how computationally intensive it is to turn a generic refcat bucket into our own format on the fly.

I hope that 1) is not very often, now that we’ve got proper motion and parallax terms defined in our catalogs, and we have a versioning system so you at least know what you’ve read. If the format is mostly stable from now on, we shouldn’t have to worry about re-ingesting very much.

For 2) the balance of work I think depends on whether most catalogs are “close enough” to our existing ingestion system (favoring using a slightly modified version of that), or whether most are “different enough” (favoring using a “everything in the bucket” approach). The only refcat examples we have right now are PS1 and Gaia, and I don’t have a PS1 DR2 catalog to make a judgement on it in comparison with Gaia.

I fear that 3) may be more painful than we want. I don’t believe that we have a way of doing vectorized conversions into our SpherePoint and Angle units (hence why e.g. _setProperMotion() looks the way it does) so we’d either have to write C++ code for the specialized loaders, or deal with some non-trivial row-level operations at the python level. Standardizing everything on ingest means that the reading code can be trivial and thus very fast.

Is any of this affected by gen3 plans for dealing with reference catalogs? I don’t know if anything about the refcats is going to change in gen3. Since gen3 leans more on databases, would it be worth dropping the sharded files approach and dumping the refcats into a big database instead (and thus having a reason to redo the readers anyway)?

I think the questions you had that kicked off this thread are an indication that the format has not yet stabilized.

We need that anyway, and I’ve got a lot of it already implemented on an RFC-blocked ticket.

It would be feasible to put the reference catalogs in a database in Gen3, but it’s not something I’d want to do before Gen2 is retired, and it’s not obviously easier/better than continuing to work with sharded files. I do think Gen3 overall encourages a sharper split between the spatial indexing, filtering, and concatenating aspect of loading vs. the column mapping aspect of loading; it would be quite natural for all of the spatial stuff to eventually happen behind butler.get, and while we could do column remapping/standardization there too, that suffers from many of the same problems of doing raw standardization behind butler.get. I’d eventually (i.e. after Gen2 retirement) move to a model where both raw standardization and refcat standardization happens on top of butler.get in Task code; whether we also provide syntactic sugar to make it look like those things happen behind butler in some contexts is an open question @rhl and I have discussed at length and haven’t yet converged on (and may not until after Gen2 is retired).

and I’ve replied on that ticket! The summary is that I think we still need to add magnitude errors to these GDR2 astrometric catalogues, and also BP/RP photometry.

I disagree with this point: I think that our existing “maximal” refcat schema provides everything we might need for LSST. The trick is getting external data into that format, and more clearly defining what some of the fields mean (e.g. the various flags).

Letting the refcat schema change over time because the reading code changes doesn’t help the problem that the code that uses the refcat still has to know how to work with the various versions. For example, changing behavior if field X is not in the refcat doesn’t care whether the field isn’t there because the on-disk format changed, or because we’ve modified the reader to compute that field from the “raw” on-disk data.

Ok, so maybe it wasn’t accurate to say that the format hasn’t stabilized. But my larger point that the on-disk files that we have right now are unstable still holds, because we haven’t stabilized the mapping from upstream catalogs to our format.

This is an intrinsically hard problem, and I’m not saying storing the refcats on disk with their upstream schemas simply solves it. I’m saying it lets us approach the problem incrementally and continuously, because we don’t have to re-ingest the upstream catalogs every time our understanding of the upstream catalog and how it relates to the data we want to process improves.