Processing-output metadata and metrics in Gen3 queries and QuantumGraph generation

We’ve had a few discussions on Slack about how to make exposure or visit (or similar) metadata values that are produced by processing (e.g. seeing derived from the PSF model, not DIMM data) available to the Gen3 butler query system and QuantumGraph generation in particular. A few different ideas for doing this have been brought up, and I wanted to get them in one place and expound a bit on what I think the best approach looks like now.

First, I should make it clear that there are probably cases where using registry queries to do selections on (say) seeing isn’t actually what you want - a PipelineTask that cares about a very specific definition of some metadata quantity may actually want to filter its inputs in its own run method. And in other cases, the registry query may just be one of several ways to generate an explicit list of data IDs that could be saved in some other form, and possibly manually vetted before they are actually used in QG generation. Let’s not get distracted by those possibilities here.

The original scheme for this kind of metadata was developed in the butler working group days, and it involves having dedicated metadata tables for each StorageClass, and having those populated by the Formatter when a dataset is put, by having it look at the in-memory Python object it’s given and extracting some stuff from it to give back to the Registry before it actually writes the file. So in the case of seeing, we’d have a seeing column (allowed to be NULL) in some table that would have a row for each afw.image.Exposure dataset we write, and the Formatter classes that deal with that class would all know they they have to do object.getPsf().computeMoments().getDeterminantRadius() or something similar to compute that seeing value to give back to the registry.

This is all still viable, though it’s never been a high priority, and it’s a lot of fairly fundamental butler API and schema changes (the kind that would require broad but probably not very deep changes elsewhere). And it may still be a good approach for saving simpler things, like the number of pixels in an image or the number of rows in a catalog. But I don’t like the inflexibility of things being set only at put-time, and then locked there forever, or the idea that Formatters should know quite that much about the content of what they’re saving. We’d also have to add support for looking at those metadata tables to the query system itself, but that’s the case for any approach to this problem.

I think the new metrics-gathering system that @KSK, @bechtol, and @jeffcarlin are building on top PipelineTask provides a path to an alternative. Seeing could just be a trivial PipelineTask metric they compute, or even something they gather from task-level metadata (I’m not sure how tied together their system is to the MetricTask system that @kfindeisen et al put together in Gen2 with Gen3 in mind, but I’m sure we can make them interoperate if they don’t already). And if we can make the query system hook into those metric values, we’ll give it access to much more than we’d have ever considered saving at put time in the previous approach. Many of the pieces are already there, or will be arriving soon:

  • Registry has a system for creating “opaque” tables in the same database (and schema) that holds the registry itself. We already use this in most Datastores, to hold information that is private to that Datastore.
  • Our query expression parser currently only permits identifiers that are part of our predefined “dimensions” system, but it wouldn’t be at all difficult to allow it to handle opaque tables as well.
  • We can already define a chain of Datastores that delegates saving particular dataset types or storage classes to a special Datastore.

So, instead of saving our metric measurements to tiny JSON files, we’d write a simple, single-purpose custom Datastore that saves metric measurements to opaque table rows in the Registry database instead. We might only need one table for all metrics - the columns would just be the integer dataset_id and the actual metric measurement data - so the question is how many columns we need for the metric measurement data, and how stable that is. The dataset_id would be sufficient to tie it back to the actual name of the metric (via the dataset type), the processing run, and the data ID.

That’s not quite enough; we’d also need to make it so the opaque table registration system could be told that a dataset_id column will be present in the table - the query system would need that information to know how to connect any expression on opaque table columns back to the rest of the query (i.e. how to join the opaque table to other tables). That’s not hard at all, compared to the custom Datastore, and it might even be something we’d want to consider in the internal-Datastore opaque tables, if we wanted to provide query system access to anything in those (by definition making them a bit less private-to-Datastore, of course).

Overall, this seems much easier to implement (even with a custom Datastore) than the original approach we had in mind, and much more powerful. It’s also layered much more nicely, because it’s not something Butler itself would natively support, but rather something we’d be building on top of Butler using features it already (mostly) provides. And it’s not quite a schema change, with the way we define those - these are all tables that the Butler is happy to add to a repo after that repo has already been created, and it has no problem reading repos that don’t have those tables, because they’re not intrinsic to the repo. The metrics-based approach is also not exclusive; we could do the original approach, too, but I’m hoping that the metrics-based approach will work well enough that we don’t have to.

Are you thinking of a (dataset_id, metric_a, metric_b, metric_c) table or a (dataset_id, metric_name, metric_value) table?

In the latter case, you may need multiple columns (and additional join complexity) if the metric values cannot all be stored in a single SQL data type, but adding new metrics is trivial. Query performance may be degraded, however, depending on the kinds of queries actually used. And we need to consider that we may have billions of datasets in a 2033 repo.

Doesn’t this basically force you to instrument your pipelines with lots of metrics if you want to make use of selection functionality assumed by your own tasks? While MetricTasks certainly can be prerequisites for other tasks, using them as structural elements in a science pipeline seems like a misuse.

you may need multiple columns (and additional join complexity) if the metric values cannot all be stored in a single SQL data type

All metric values are floating-point scalars. It’s an intrinsic limitation of the lsst.verify system.

First, we are using MetricTask as the base class for all our metric measurement tasks, so we are completely tied in with that ecosystem.

I’m not sure that we would need MetricTask to be a structural element if the butler knows how to put Measurement objects to an opaque table.

It won’t be necessary - or even useful, actually - to instrument the pipeline one is running with MetricTasks, because we actually need the metric measurements in hand before QuantumGraph generation in this scheme. Instead, metric measurements would have to be obtained from previous processing runs that were instrumented with MetricTasks, and may have nothing in common with the current input collections other than data IDs (though they could also be the same as the input collections, or a subset of them, etc).

Sorry, can you give a concrete example? I’m not sure how your “previous processing run” is different from what I described.