This is a very technical post intended primarily for Gen3 middleware insiders. If you have an opinion but lack the background for all of it to make sense, please ask for clarification here or in #dm-middleware on Slack.
I’d like to get some feedback on how to handle in-database provenance (quanta) when deleting datasets. I’m talking about full deletion here - it’s safer and easier by far to just delete datastore artifacts (e.g. files) and/or disassociate from tagged collections, and none of the below applies to those cases, which don’t affect provenance at all.
As a bit of background, the DAG of dataset provenance mirrors the one used to produce the dataset via PipelineTasks
; there are three tables that are relevant:
-
the
dataset
table has its owndataset_id
primary key, and aquantum_id
foreign key to thequantum
table; this is the quantum used to produce the dataset (it also has other fields that aren’t relevant for this discussion); -
the
quantum
table has its ownquantum_id
primary key (and other fields not relevant for this discussion); -
the
dataset_consumers
table has bothquantum_id
anddataset_id
foreign keys, and describes the datasets that were used as inputs to thatquantum
. Those fields together are used as a compound primary key. (There is also anactual
bool field that can be used to describe whether the input was merely predicted or actually used, but it’s not relevant for this discussion).
When we fully delete a dataset from the Registry
, we’ve by definition destroyed at least some of the provenance of any datasets that were produced by pipelines that used the deleted dataset as an input. I had up to now assumed that the right way to handle this was to aggressively delete any quanta that used the dataset as an input, because it was better for them to be gone than to be incomplete. This can’t be done fully by ON DELETE CASCADE
, which is slightly annoying but not a fundamental problem, because the relationship between dataset and the quanta that use them is mediated by dataset_consumers
. Deleting the quantum just sets dataset.quantum_id
to NULL
(via ON DELETE SET NULL
) for datasets produced by that quantum, which is nice and clean (and easy): further-downstream quanta aren’t destroyed, but we’ve made it clear that we no longer have any information about where those datasets came from.
I think we have a few other options now, and I wanted to get see if there were any preferences from others:
A. We could remove the foreign key constraint on dataset_consumers.dataset_id
, and allow it to contain a value that isn’t in the dataset
table at all (because it’s been deleted). The rest of the quantum would stay, unless it was deleted explicitly. This saves us some logic, and it preserves a bit more of the provenance. It does mean code that uses quanta needs to be prepared for the possibility of a dangling reference to a dataset, and to recognize that as an indication of incomplete provenance. Previously this would have been problematic because SQLite can sometimes recycle autoincrement IDs that had been used by now-deleted rows (leading to incorrect links, not just broken ones) - but we already need to tell it not to do that (which we can and will shortly do) to avoid the same problem in dataset_location_trash
.
B. We could allow dataset_consumers.dataset_id
to be NULL
, and define its foreign key constraint with ON DELETE SET NULL
. This is a lot like (1) in behavior, but NULL
is a much nicer way to represent “there was a dataset here that’s been deleted” than a dangling ID value (and it maps nicely to None
in Python). Making the field nullable would mean we’d have to remove it from the compound primary key, and either not have one or define a surrogate autoincrement key.
C. Make the foreign key in dataset_consumers.dataset_id
ON DELETE CASCADE
(so the whole row disappears when the dataset is deleted), but record that the quantum is incomplete some other way. I’m thinking vaguely of storing a hash/checksum/count of the quantum’s complete set of inputs in the quantum
row itself, which can then be compared to the set of inputs according to the current (post-deletion) state of dataset_consumers
to see if the quantum is still complete. That requires code using the provenance to make that check, but I think we can embed that as a flag in the Quantum
Python class itself and the Registry
code that retrieves them (which doesn’t yet exist).