This is a very technical post intended primarily for Gen3 middleware insiders. If you have an opinion but lack the background for all of it to make sense, please ask for clarification here or in #dm-middleware on Slack.
I’d like to get some feedback on how to handle in-database provenance (quanta) when deleting datasets. I’m talking about full deletion here - it’s safer and easier by far to just delete datastore artifacts (e.g. files) and/or disassociate from tagged collections, and none of the below applies to those cases, which don’t affect provenance at all.
As a bit of background, the DAG of dataset provenance mirrors the one used to produce the dataset via
PipelineTasks; there are three tables that are relevant:
datasettable has its own
dataset_idprimary key, and a
quantum_idforeign key to the
quantumtable; this is the quantum used to produce the dataset (it also has other fields that aren’t relevant for this discussion);
quantumtable has its own
quantum_idprimary key (and other fields not relevant for this discussion);
dataset_consumerstable has both
dataset_idforeign keys, and describes the datasets that were used as inputs to that
quantum. Those fields together are used as a compound primary key. (There is also an
actualbool field that can be used to describe whether the input was merely predicted or actually used, but it’s not relevant for this discussion).
When we fully delete a dataset from the
Registry, we’ve by definition destroyed at least some of the provenance of any datasets that were produced by pipelines that used the deleted dataset as an input. I had up to now assumed that the right way to handle this was to aggressively delete any quanta that used the dataset as an input, because it was better for them to be gone than to be incomplete. This can’t be done fully by
ON DELETE CASCADE, which is slightly annoying but not a fundamental problem, because the relationship between dataset and the quanta that use them is mediated by
dataset_consumers. Deleting the quantum just sets
ON DELETE SET NULL) for datasets produced by that quantum, which is nice and clean (and easy): further-downstream quanta aren’t destroyed, but we’ve made it clear that we no longer have any information about where those datasets came from.
I think we have a few other options now, and I wanted to get see if there were any preferences from others:
A. We could remove the foreign key constraint on
dataset_consumers.dataset_id, and allow it to contain a value that isn’t in the
dataset table at all (because it’s been deleted). The rest of the quantum would stay, unless it was deleted explicitly. This saves us some logic, and it preserves a bit more of the provenance. It does mean code that uses quanta needs to be prepared for the possibility of a dangling reference to a dataset, and to recognize that as an indication of incomplete provenance. Previously this would have been problematic because SQLite can sometimes recycle autoincrement IDs that had been used by now-deleted rows (leading to incorrect links, not just broken ones) - but we already need to tell it not to do that (which we can and will shortly do) to avoid the same problem in
B. We could allow
dataset_consumers.dataset_id to be
NULL, and define its foreign key constraint with
ON DELETE SET NULL. This is a lot like (1) in behavior, but
NULL is a much nicer way to represent “there was a dataset here that’s been deleted” than a dangling ID value (and it maps nicely to
None in Python). Making the field nullable would mean we’d have to remove it from the compound primary key, and either not have one or define a surrogate autoincrement key.
C. Make the foreign key in
ON DELETE CASCADE (so the whole row disappears when the dataset is deleted), but record that the quantum is incomplete some other way. I’m thinking vaguely of storing a hash/checksum/count of the quantum’s complete set of inputs in the
quantum row itself, which can then be compared to the set of inputs according to the current (post-deletion) state of
dataset_consumers to see if the quantum is still complete. That requires code using the provenance to make that check, but I think we can embed that as a flag in the
Quantum Python class itself and the
Registry code that retrieves them (which doesn’t yet exist).