See the bottom of this post for a glossary that may help Gen3 non-experts understand it.
The problems:
Right now the Gen3 butler has two high-level methods, pruneCollection
and pruneDatasets
, which try to cover all operations that look or smell like dataset deletion, including all of those handled by the also-public remove
, removeCollection
, and disassociate
methods of Registry
. The Butler
methods (and their command-line counterparts) have been tough to maintain, test, and use, and I think it’s fundamentally because they try to do too much: we’ve ended up trying to support many operations that no one may ever use, just because they’re the logical combination of various options/arguments we need for other reasons (e.g. unstoring the datasets in a TAGGED
collection).
I also think many of those operations need to be removed now to avoid complicating our ownership model in future data repositories with a real concept of user or group ownership of datasets; if one can modify a dataset via a reference to it from some non-RUN
collection, we’ll need many different more kinds of ACLs.
In addition, right now we have one particularly important pain point, captured on DM-28857: it’s currently hard to delete the collection structure produced by pipetask
(and as of DM-28960, BPS), which involves a CHAINED
collection that references both output RUN
collections and input collections of many types. One can’t delete the RUN
collections first, because that trips a foreign key violation as long as they are referenced by the CHAINED
collection, and if one deletes the CHAINED
collection first, the easiest way to find those RUN
collections also goes away (but note that one doesn’t want to delete the input collections, and butler has no way to tell the difference using the CHAINED
collection, so it’s not that easy).
Finally, these methods are designed to encourage only unstoring
datasets (while leaving their Registry
description), to preserve provenance, but this is premature and annoying to users: they want to fully delete things, because there isn’t actually any provenance to preserve, and I think we need to provide a better way to “hide” collections before we make it too hard to fully delete them. That seems doable via an extra flag column in the collections table, but only with a schema change. Since adding provenance also will require a schema change, we can do those at the same time (later).
The near-term proposal:
-
We add a new method,
Butler.removeRuns
which fully removes one or moreRUN
-type collections and all of the datasets within them (I’ve started this on DM-29106). -
We remove the
Butler.pruneCollections
method, leavingButler.removeRuns
as the recommended way to deleteRUN
-type collections andRegistry.removeCollections
as the way to remove all other kinds of collections (which would no longer involve any kind of dataset deletion, because the references to datasets from those collections don’t imply any ownership that should allow one to do that). -
We also remove the
Butler.pruneDatasets
method, leaving us with no high-level way (for now) to fully delete individual datasets. I don’t think we have a use case for this right now, and I’d like to give us a chance to think about the future ownership model and actual use cases before reintroducing something like it (and I expect it will be replaced by multiple simpler methods for different kinds of deletion, as I am proposing we do now for collection). -
We change the deletion logic for collections to allow child collections to be deleted while they are referenced by
CHAINED
collections, by replacing them there first with a special sentinal “[deleted]” collection, which can be used as a way to notify the owner of theCHAINED
collection that this occurred.
On the command-line side of things,
-
butler prune-collection
andbutler prune-datasets
would go away; -
butler remove-runs
andbutler remove-collections
would be added (the former would deleteRUN
collections and always delete datasets; the latter would delete non-RUN
collections and never delete datasets). -
we add a
pipetask purge
command, which deletes all output-RUN
collections and the outputCHAINED
collection matching the usual pattern; -
we add a
pipetask cleanup
command, which deletes all output-RUN
collections that are not referenced by the namedCHAINED
collection but do match its name pattern (i.e. those left behind by--replace-run
without--prune-replaced
).
The last two options belong on pipetask
, not butler
, because it’s pipetask
that defines the naming convention they rely upon to know what to delete. They would not necessarily be able to work on processing runs where --output-run
was used to customize the RUN
names.
Glossary:
unstore
: delete files andDatastore
records without (necessarily) deletingRegistry
records.forget
: deleteDatastore
records without deleting files or (necessarily) deletingRegistry
records.RUN
: a kind of butler collection that datasets intrinsically belong toTAGGED
: a kind of butler collection that only references datasetsCHAINED
: a kind of butler collection that references other collectionsdisassociate
: remove a reference to a dataset from aTAGGED
collection