See the bottom of this post for a glossary that may help Gen3 non-experts understand it.
Right now the Gen3 butler has two high-level methods,
pruneDatasets, which try to cover all operations that look or smell like dataset deletion, including all of those handled by the also-public
disassociate methods of
Butler methods (and their command-line counterparts) have been tough to maintain, test, and use, and I think it’s fundamentally because they try to do too much: we’ve ended up trying to support many operations that no one may ever use, just because they’re the logical combination of various options/arguments we need for other reasons (e.g. unstoring the datasets in a
I also think many of those operations need to be removed now to avoid complicating our ownership model in future data repositories with a real concept of user or group ownership of datasets; if one can modify a dataset via a reference to it from some non-
RUN collection, we’ll need many different more kinds of ACLs.
In addition, right now we have one particularly important pain point, captured on DM-28857: it’s currently hard to delete the collection structure produced by
pipetask (and as of DM-28960, BPS), which involves a
CHAINED collection that references both output
RUN collections and input collections of many types. One can’t delete the
RUN collections first, because that trips a foreign key violation as long as they are referenced by the
CHAINED collection, and if one deletes the
CHAINED collection first, the easiest way to find those
RUN collections also goes away (but note that one doesn’t want to delete the input collections, and butler has no way to tell the difference using the
CHAINED collection, so it’s not that easy).
Finally, these methods are designed to encourage only
unstoring datasets (while leaving their
Registry description), to preserve provenance, but this is premature and annoying to users: they want to fully delete things, because there isn’t actually any provenance to preserve, and I think we need to provide a better way to “hide” collections before we make it too hard to fully delete them. That seems doable via an extra flag column in the collections table, but only with a schema change. Since adding provenance also will require a schema change, we can do those at the same time (later).
The near-term proposal:
We add a new method,
Butler.removeRunswhich fully removes one or more
RUN-type collections and all of the datasets within them (I’ve started this on DM-29106).
We remove the
Butler.removeRunsas the recommended way to delete
RUN-type collections and
Registry.removeCollectionsas the way to remove all other kinds of collections (which would no longer involve any kind of dataset deletion, because the references to datasets from those collections don’t imply any ownership that should allow one to do that).
We also remove the
Butler.pruneDatasetsmethod, leaving us with no high-level way (for now) to fully delete individual datasets. I don’t think we have a use case for this right now, and I’d like to give us a chance to think about the future ownership model and actual use cases before reintroducing something like it (and I expect it will be replaced by multiple simpler methods for different kinds of deletion, as I am proposing we do now for collection).
We change the deletion logic for collections to allow child collections to be deleted while they are referenced by
CHAINEDcollections, by replacing them there first with a special sentinal “[deleted]” collection, which can be used as a way to notify the owner of the
CHAINEDcollection that this occurred.
On the command-line side of things,
butler prune-datasetswould go away;
butler remove-collectionswould be added (the former would delete
RUNcollections and always delete datasets; the latter would delete non-
RUNcollections and never delete datasets).
we add a
pipetask purgecommand, which deletes all output-
RUNcollections and the output
CHAINEDcollection matching the usual pattern;
we add a
pipetask cleanupcommand, which deletes all output-
RUNcollections that are not referenced by the named
CHAINEDcollection but do match its name pattern (i.e. those left behind by
The last two options belong on
butler, because it’s
pipetask that defines the naming convention they rely upon to know what to delete. They would not necessarily be able to work on processing runs where
--output-run was used to customize the
unstore: delete files and
Datastorerecords without (necessarily) deleting
Datastorerecords without deleting files or (necessarily) deleting
RUN: a kind of butler collection that datasets intrinsically belong to
TAGGED: a kind of butler collection that only references datasets
CHAINED: a kind of butler collection that references other collections
disassociate: remove a reference to a dataset from a