One of the tasks on my radar for Gen3 Butler development is a refactor of how we specialize various operations for different databases in the Registry hierarchy. I already know how I want to do a lot of that, but one very open general question is how best to handle bulk operations across the Python/SQL boundary.
As an example, we have a Python API (selectDimensions) that runs a big query that returns a an iterator over data structures that contain (among other things) reference to datasets. We have another Python API (associate) that takes a sequence of references to datasets and adds them to a collection. That means I could combine these with Python code like this:
def extractDatasets(iter):
for row in iter:
for dataset in row.datasetRefs.values():
yield dataset
result = registry.selectDimensions(...)
registry.associate("MyCollection", extractDatasets(result))
That all looks quite natural in Python, and the fact that we’re using Python iterators rather than collections means it should scale fine in terms of memory usage even for very large queries.
Under the hood, that actually involves iterating over the results, constructing a bunch of Python objects to represent them (many of which are never used in this example), and then iterating and transforming them again in various ways, just to run one INSERT statement for each dataset. (I also think I know how to get that done in one or a few INSERTs instead by batching rows up in the last step; let’s not worry about that distinction for now).
It strikes me that it would potentially be much more efficient if we could do this lazily and keep all of the actual manipulations and iteration in the database, culminating in one INSERT ... SELECT operation. I can roughly imagine what it would take to do that:
Have the selectDimensions method return a special iterator that just remembers the query and doesn’t actually execute it until you try to iterate over it.
Move the logic in the extractDatasets transformation function into a method on that special iterator class, along with any other transformations we think might be useful, so we can also make them return similarly special iterators based on modified queries.
Have associate check whether what its given was a special iterator or a regular Python iterable, and give it a special implementation for the former case that does an INSERT ... SELECT.
That’s a lot of extra complexity in the implementation, but I could imagine the speedup being important, both when we have genuinely large-scale queries in Oracle registries and when time spent holding file locks needs to be minimized at medium scale in SQLite. I’m not at the point where I’d want to implement this kind of lazy evaluation on a bunch of registry methods now, but I am strongly considering trying now to make our APIs consistent with adding with that kind of implementation in the future.
So, for those of you with more database development experience than me, a few questions:
Is there prior art for designing these kinds of APIs that I should read up on?
The above case is a simple one, in that associate only needs to insert into one table. Some of our bulk operations need to insert into or delete from multiple tables, and for those to work we’d need to either repeat the passed-through SELECT or use temporary tables. I assume which is best would require context-dependent tests, but any thoughts on what to do by default?
Any other questions I should be asking about this kind of thing?
To avoid going all ORM/lazy-evaluation, you might try something a lot simpler: create a specialized selectDimensionsDatasets method that only gets dataset handles/ids from the database. Is that sufficiently reusable to be worthwhile?
I think there are variants on that approach that would definitely be reusable. My concern would be that I don’t have a lot of intuition for where that approach would land in terms of performance. It seems at least moderately likely that just doing any per-dataset Python iteration or client-server back-and-forth in these bulk operations will come with enough overhead that optimizing what happens in those loops or transfers doesn’t matter much.
Of course, from that perspective, maybe even just trying to make the interfaces potentially scalable this way is premature optimization right now.
I was assuming that there might be up to hundreds of rows to be inserted, in which case having a single column versus an entire row would make a difference.
If there are thousands (or millions?) of records that have to be inserted within one transaction on commonly-used tables, then you may be right to be concerned about having data even leave the database server. But in that case, I’d argue that your scalability is inherently problematic and that, rather than trying to optimize the transactions, you might be better off looking for a way to restructure to a transaction-free or minimal-transaction model.
The idea with delayed evaluation is reasonable of course and things like that can help but I have not seen approach like that implemented in database APIs. One alternative approach could be to redesign API so that you don’t need those intermediate steps between select and insert, i.e instead of code like this (or in addition to it):
In terms of orders of sizing, the (lone) data point I can easily get at right now is that ci_hsc is about 3k records in the dataset table. That’s a pretty small graph, so I’d guess it’s on the small side of what we’d process in a typical single worker node in one job in a production. But I don’t actually have a clear requirement that we do any operations on that scale in a single transaction. I was actually more concerned that doing bulk operations in many smaller transactions would be slower, since it very much is slower in SQLite (because each transaction is a lock acquire/release).
Trying to reduce transactions is certainly worth thinking about. I think we probably do want pretty rigorous atomic operations in Butler usage in general, and I think that’s a lot easier if we try to do that via the transaction support in the databases we’re using rather than layering something of our own on top. But the contexts where that’s hard from a performance standpoint are probably also contexts where there’s an extra level of middleware controlling what’s going on - the case we’ve run into so far is of course lots of PipelineTasks running concurrently writing to the same Butler - and we may be able to rely on that middleware level to more efficiently guarantee consistency and thus minimize our use of DB transactions when that’s the case. FWIW, that’s not actually even case that involves bulk operations.
And that brings me back to the point that I’m probably overthinking this at this stage, because I don’t actually know that there will be a problem with slow bulk operations. It’s good to know that I’m not missing a lot of prior art or design patterns in this area, and while I may keep any eye out for and slightly prefer APIs that could be implemented with lazy evaluation in the future, I won’t put a ton of effort into that. We can, after all, always add optimized bulk interfaces for operations we only have row-by-row APIs for now, without removing the current APIs.