We’ve had a few discussions on Slack about how to make exposure or visit (or similar) metadata values that are produced by processing (e.g. seeing derived from the PSF model, not DIMM data) available to the Gen3 butler query system and
QuantumGraph generation in particular. A few different ideas for doing this have been brought up, and I wanted to get them in one place and expound a bit on what I think the best approach looks like now.
First, I should make it clear that there are probably cases where using registry queries to do selections on (say) seeing isn’t actually what you want - a
PipelineTask that cares about a very specific definition of some metadata quantity may actually want to filter its inputs in its own
run method. And in other cases, the registry query may just be one of several ways to generate an explicit list of data IDs that could be saved in some other form, and possibly manually vetted before they are actually used in QG generation. Let’s not get distracted by those possibilities here.
The original scheme for this kind of metadata was developed in the butler working group days, and it involves having dedicated metadata tables for each
StorageClass, and having those populated by the
Formatter when a dataset is
put, by having it look at the in-memory Python object it’s given and extracting some stuff from it to give back to the
Registry before it actually writes the file. So in the case of seeing, we’d have a
seeing column (allowed to be
NULL) in some table that would have a row for each
afw.image.Exposure dataset we write, and the
Formatter classes that deal with that class would all know they they have to do
object.getPsf().computeMoments().getDeterminantRadius() or something similar to compute that
seeing value to give back to the registry.
This is all still viable, though it’s never been a high priority, and it’s a lot of fairly fundamental butler API and schema changes (the kind that would require broad but probably not very deep changes elsewhere). And it may still be a good approach for saving simpler things, like the number of pixels in an image or the number of rows in a catalog. But I don’t like the inflexibility of things being set only at
put-time, and then locked there forever, or the idea that
Formatters should know quite that much about the content of what they’re saving. We’d also have to add support for looking at those metadata tables to the query system itself, but that’s the case for any approach to this problem.
I think the new metrics-gathering system that @KSK, @bechtol, and @jeffcarlin are building on top
PipelineTask provides a path to an alternative. Seeing could just be a trivial
PipelineTask metric they compute, or even something they gather from task-level metadata (I’m not sure how tied together their system is to the
MetricTask system that @kfindeisen et al put together in Gen2 with Gen3 in mind, but I’m sure we can make them interoperate if they don’t already). And if we can make the query system hook into those metric values, we’ll give it access to much more than we’d have ever considered saving at
put time in the previous approach. Many of the pieces are already there, or will be arriving soon:
Registryhas a system for creating “opaque” tables in the same database (and schema) that holds the registry itself. We already use this in most
Datastores, to hold information that is private to that
- Our query expression parser currently only permits identifiers that are part of our predefined “dimensions” system, but it wouldn’t be at all difficult to allow it to handle opaque tables as well.
- We can already define a chain of
Datastoresthat delegates saving particular dataset types or storage classes to a special
So, instead of saving our metric measurements to tiny JSON files, we’d write a simple, single-purpose custom
Datastore that saves metric measurements to opaque table rows in the Registry database instead. We might only need one table for all metrics - the columns would just be the integer
dataset_id and the actual metric measurement data - so the question is how many columns we need for the metric measurement data, and how stable that is. The
dataset_id would be sufficient to tie it back to the actual name of the metric (via the dataset type), the processing run, and the data ID.
That’s not quite enough; we’d also need to make it so the opaque table registration system could be told that a
dataset_id column will be present in the table - the query system would need that information to know how to connect any expression on opaque table columns back to the rest of the query (i.e. how to join the opaque table to other tables). That’s not hard at all, compared to the custom
Datastore, and it might even be something we’d want to consider in the internal-
Datastore opaque tables, if we wanted to provide query system access to anything in those (by definition making them a bit less private-to-Datastore, of course).
Overall, this seems much easier to implement (even with a custom
Datastore) than the original approach we had in mind, and much more powerful. It’s also layered much more nicely, because it’s not something
Butler itself would natively support, but rather something we’d be building on top of
Butler using features it already (mostly) provides. And it’s not quite a schema change, with the way we define those - these are all tables that the Butler is happy to add to a repo after that repo has already been created, and it has no problem reading repos that don’t have those tables, because they’re not intrinsic to the repo. The metrics-based approach is also not exclusive; we could do the original approach, too, but I’m hoping that the metrics-based approach will work well enough that we don’t have to.