We’ve had a few discussions on Slack about how to make exposure or visit (or similar) metadata values that are produced by processing (e.g. seeing derived from the PSF model, not DIMM data) available to the Gen3 butler query system and QuantumGraph
generation in particular. A few different ideas for doing this have been brought up, and I wanted to get them in one place and expound a bit on what I think the best approach looks like now.
First, I should make it clear that there are probably cases where using registry queries to do selections on (say) seeing isn’t actually what you want - a PipelineTask
that cares about a very specific definition of some metadata quantity may actually want to filter its inputs in its own run
method. And in other cases, the registry query may just be one of several ways to generate an explicit list of data IDs that could be saved in some other form, and possibly manually vetted before they are actually used in QG generation. Let’s not get distracted by those possibilities here.
The original scheme for this kind of metadata was developed in the butler working group days, and it involves having dedicated metadata tables for each StorageClass
, and having those populated by the Formatter
when a dataset is put
, by having it look at the in-memory Python object it’s given and extracting some stuff from it to give back to the Registry
before it actually writes the file. So in the case of seeing, we’d have a seeing
column (allowed to be NULL
) in some table that would have a row for each afw.image.Exposure
dataset we write, and the Formatter
classes that deal with that class would all know they they have to do object.getPsf().computeMoments().getDeterminantRadius()
or something similar to compute that seeing
value to give back to the registry.
This is all still viable, though it’s never been a high priority, and it’s a lot of fairly fundamental butler API and schema changes (the kind that would require broad but probably not very deep changes elsewhere). And it may still be a good approach for saving simpler things, like the number of pixels in an image or the number of rows in a catalog. But I don’t like the inflexibility of things being set only at put
-time, and then locked there forever, or the idea that Formatters
should know quite that much about the content of what they’re saving. We’d also have to add support for looking at those metadata tables to the query system itself, but that’s the case for any approach to this problem.
I think the new metrics-gathering system that @KSK, @bechtol, and @jeffcarlin are building on top PipelineTask
provides a path to an alternative. Seeing could just be a trivial PipelineTask
metric they compute, or even something they gather from task-level metadata (I’m not sure how tied together their system is to the MetricTask
system that @kfindeisen et al put together in Gen2 with Gen3 in mind, but I’m sure we can make them interoperate if they don’t already). And if we can make the query system hook into those metric values, we’ll give it access to much more than we’d have ever considered saving at put
time in the previous approach. Many of the pieces are already there, or will be arriving soon:
-
Registry
has a system for creating “opaque” tables in the same database (and schema) that holds the registry itself. We already use this in mostDatastores
, to hold information that is private to thatDatastore
. - Our query expression parser currently only permits identifiers that are part of our predefined “dimensions” system, but it wouldn’t be at all difficult to allow it to handle opaque tables as well.
- We can already define a chain of
Datastores
that delegates saving particular dataset types or storage classes to a specialDatastore
.
So, instead of saving our metric measurements to tiny JSON files, we’d write a simple, single-purpose custom Datastore
that saves metric measurements to opaque table rows in the Registry database instead. We might only need one table for all metrics - the columns would just be the integer dataset_id
and the actual metric measurement data - so the question is how many columns we need for the metric measurement data, and how stable that is. The dataset_id
would be sufficient to tie it back to the actual name of the metric (via the dataset type), the processing run, and the data ID.
That’s not quite enough; we’d also need to make it so the opaque table registration system could be told that a dataset_id
column will be present in the table - the query system would need that information to know how to connect any expression on opaque table columns back to the rest of the query (i.e. how to join the opaque table to other tables). That’s not hard at all, compared to the custom Datastore
, and it might even be something we’d want to consider in the internal-Datastore
opaque tables, if we wanted to provide query system access to anything in those (by definition making them a bit less private-to-Datastore, of course).
Overall, this seems much easier to implement (even with a custom Datastore
) than the original approach we had in mind, and much more powerful. It’s also layered much more nicely, because it’s not something Butler
itself would natively support, but rather something we’d be building on top of Butler
using features it already (mostly) provides. And it’s not quite a schema change, with the way we define those - these are all tables that the Butler is happy to add to a repo after that repo has already been created, and it has no problem reading repos that don’t have those tables, because they’re not intrinsic to the repo. The metrics-based approach is also not exclusive; we could do the original approach, too, but I’m hoping that the metrics-based approach will work well enough that we don’t have to.