Deleting datasets and provenance in the Gen3 Butler

This is a very technical post intended primarily for Gen3 middleware insiders. If you have an opinion but lack the background for all of it to make sense, please ask for clarification here or in #dm-middleware on Slack.

I’d like to get some feedback on how to handle in-database provenance (quanta) when deleting datasets. I’m talking about full deletion here - it’s safer and easier by far to just delete datastore artifacts (e.g. files) and/or disassociate from tagged collections, and none of the below applies to those cases, which don’t affect provenance at all.

As a bit of background, the DAG of dataset provenance mirrors the one used to produce the dataset via PipelineTasks; there are three tables that are relevant:

  • the dataset table has its own dataset_id primary key, and a quantum_id foreign key to the quantum table; this is the quantum used to produce the dataset (it also has other fields that aren’t relevant for this discussion);

  • the quantum table has its own quantum_id primary key (and other fields not relevant for this discussion);

  • the dataset_consumers table has both quantum_id and dataset_id foreign keys, and describes the datasets that were used as inputs to that quantum. Those fields together are used as a compound primary key. (There is also an actual bool field that can be used to describe whether the input was merely predicted or actually used, but it’s not relevant for this discussion).

When we fully delete a dataset from the Registry, we’ve by definition destroyed at least some of the provenance of any datasets that were produced by pipelines that used the deleted dataset as an input. I had up to now assumed that the right way to handle this was to aggressively delete any quanta that used the dataset as an input, because it was better for them to be gone than to be incomplete. This can’t be done fully by ON DELETE CASCADE, which is slightly annoying but not a fundamental problem, because the relationship between dataset and the quanta that use them is mediated by dataset_consumers. Deleting the quantum just sets dataset.quantum_id to NULL (via ON DELETE SET NULL) for datasets produced by that quantum, which is nice and clean (and easy): further-downstream quanta aren’t destroyed, but we’ve made it clear that we no longer have any information about where those datasets came from.

I think we have a few other options now, and I wanted to get see if there were any preferences from others:

A. We could remove the foreign key constraint on dataset_consumers.dataset_id, and allow it to contain a value that isn’t in the dataset table at all (because it’s been deleted). The rest of the quantum would stay, unless it was deleted explicitly. This saves us some logic, and it preserves a bit more of the provenance. It does mean code that uses quanta needs to be prepared for the possibility of a dangling reference to a dataset, and to recognize that as an indication of incomplete provenance. Previously this would have been problematic because SQLite can sometimes recycle autoincrement IDs that had been used by now-deleted rows (leading to incorrect links, not just broken ones) - but we already need to tell it not to do that (which we can and will shortly do) to avoid the same problem in dataset_location_trash.

B. We could allow dataset_consumers.dataset_id to be NULL, and define its foreign key constraint with ON DELETE SET NULL. This is a lot like (1) in behavior, but NULL is a much nicer way to represent “there was a dataset here that’s been deleted” than a dangling ID value (and it maps nicely to None in Python). Making the field nullable would mean we’d have to remove it from the compound primary key, and either not have one or define a surrogate autoincrement key.

C. Make the foreign key in dataset_consumers.dataset_id ON DELETE CASCADE (so the whole row disappears when the dataset is deleted), but record that the quantum is incomplete some other way. I’m thinking vaguely of storing a hash/checksum/count of the quantum’s complete set of inputs in the quantum row itself, which can then be compared to the set of inputs according to the current (post-deletion) state of dataset_consumers to see if the quantum is still complete. That requires code using the provenance to make that check, but I think we can embed that as a flag in the Quantum Python class itself and the Registry code that retrieves them (which doesn’t yet exist).

I don’t see any advantages of 1 over 2. 3 seems like a lot of overhead.

I don’t understand why we would allow a dataset to be deleted that’s part of the provenance of another dataset. What are the use cases for something like that to be useful?

For the most part, this kind of full deletion is something you’d do when you know you don’t care about that provenance, probably because the datasets were essentially development scratch work - it may never happen in the DBB or other official repos that are limited to very formalized processing (I’ll leave that up to others like @MichelleGower to decide whether they want that or not).

It may also be important if you’re planning to remove the dataset(s) whose provenance you’re breaking as well, but you haven’t yet - if we don’t allow deletion that removes provenance at all, then the order in which you can remove datasets is very restricted, and it might be hard/expensive to determine a viable order.

From the DBB viewpoint, I expect most of the cases will be deletions from DataStore(s) and disassociation from collections. I am having problems coming up with a use case where we’d want to delete datasets from the Registry that were used as inputs to quanta/parents of other still existing datasets in the DBB.

For test/junk runs that we absolutely do not ever need any information about, firstly it probably isn’t worth deleting small numbers of them. But let’s say we do need to purge them, I would say that’s a “purge run” function which wouldn’t break provenance of still existing datasets. If deleting a run would break provenance for other datasets, this “purge run” shouldn’t let us.

Deleting from the registry also has implications on the multi-user registry. For example, users probably have dataset id dependencies to DBB datasets in their registries. It feels like to me that maybe for now we can get by with the DataStore deletions and disassociation from collections and revisit registry dataset deletions with the multi-user registry.

We will definitely want this kind of full deletion for internal development repos (SQLite ones, and the future per-user space in multi-user repos), so I need to do something now, but I’m totally on board with the idea that whatever I do in that area may never be relevant for DBB or other “official” repos, and if it does become relevant we can work it out later.

In fact, I’m pretty sure we could multi-user registries in which the base layer doesn’t allow provenance-destroying deletion and per-user layers do, just by defining different ON DELETE clauses in the dataset_consumers table in those layers.