Query interfaces changes in Gen3 Butler

jbosch · August 7, 2020, 7:34pm

I’ve just merged DM-25919, which both renames and modifies some of the Gen3 butler query methods that many of you are already familiar with.

First, the bad news (breaking changes):

Registry.queryDimensions has been renamed to queryDataIds.
The expand argument has been removed from queryDimensions/queryDataIds and queryDatasets.

And now the good news:

queryDataIds and queryDatasets are now faster by many orders of magnitude for large queries, at least by default (performance is similar to the old speed with expand=False, but expand=True was the default).
There is a new method, queryDimensionRecords, which returns metadata rows for a dimension directly, and is hence a much more convenient interface for that purpose (compared to the old approach of querying for data IDs, and then accessing .records on those).
queryDataIds and queryDatasets now return custom iterator objects (DataCoordinateQueryResults and DatasetQueryResults) with many extra methods, most of which return new result objects (it’s a “method chaining” interface, for those of you familiar with that concept). Those include an expanded method that replaces the old expand=True keyword argument (but without the enormous performance penalty), a findDatasets method to do bulk searches for datasets whose data IDs were identified by the original query, and a materialize context manager that stores the results in a temporary table in the database, allowing follow-up related queries without having to nest (and hence possibly re-execute) the original query as a subquery or round-trip the results through Python objects. These result objects are all still lazy iterators that don’t execute the query until iteration begins; we don’t want to assume users always want to fetch all results and stuff them in a container, even if that’s often the case. They do have toSet and toSequence methods that make fetching into Python containers easy when desired.

As documented on DM-24938, these changes make the parts of QuantumGraph generation that they were intended to optimize dramatically faster, but they make what is actually the bottleneck slightly slower, so there’s little overall change in performance. But they also set the stage for optimizing that bottleneck in the same way (on DM-24432, my current project), so I’m optimistic that we’ll soon get QuantumGraph generation down from approximately an hour (per tract, on HSC) down to 10-15 minutes.

I’ll add API doc links to the text above once the weekly docs are built. User guide docs for this functionality is not yet written; there’s some more functionality I’d like to add first.