Completely/automatically deleting portions of a run's outputs with butler prune-datasets

ameisner · October 3, 2022, 7:11am

I would like to completely delete all contents within the postISRCCD and icExp subdirectories of an output run (without resorting to rm -rf, unless that’s recommended…). I am using v23_0_1 of the LSST pipelines.

The following two commands appear to delete all files within the icExp and postISRCCD directories:

butler prune-datasets $REPO --purge DECam/runs/prune_test/20220926T234050Z --datasets icExp DECam/runs/prune_test/20220926T234050Z
butler prune-datasets $REPO --purge DECam/runs/prune_test/20220926T234050Z --datasets postISRCCD DECam/runs/prune_test/20220926T234050Z

But many empty directories are left behind:

$find repo/DECam/runs/prune_test/20220926T234050Z/postISRCCD -type f |wc -l
0
$find repo/DECam/runs/prune_test/20220926T234050Z/postISRCCD -type d |wc -l
34
$find repo/DECam/runs/prune_test/20220926T234050Z/icExp -type f |wc -l
0
$find repo/DECam/runs/prune_test/20220926T234050Z/icExp -type d |wc -l
50

In this run there are only 25 CCD’s worth of outputs, so the numbers of empty directories left behind aren’t huge. But I want to soon process millions of CCD’s, in which case it seems that I’d be left with O(10 million) inodes consumed by empty directories within icExp and postISRCCD. What is the recommended way to get rid of these directories, in addition to the files that they once contained?

Also, a related but different question which maybe should be its own separate forum topic: is there a way to embed butler prune-datasets directly into my YAML-defined pipeline? I think that’d be preferable to running my pipeline and then running separate butler prune-datasets commands after the fact.

I did a brief/superficial search for any instances of “prune” within all of the YAML files in our v23_0_1 installation but came up empty:

$ find lsst_stack_v23_0_1/stack -name “*.yaml” |wc -l
4523

$ find lsst_stack_v23_0_1/stack -name “*.yaml” -exec grep -i prune {} \; |wc -l
0

Thanks very much.

timj · October 3, 2022, 3:12pm

Deleting datasets using rm is a bad idea because both registry and datastore will still think the datasets exist.

This is a deliberate choice at the moment. Deleting files is a safe operation for a system where many processes can be accessing butler simultaneously. Deleting a directory is a lot more problematic because in theory it’s entirely possible that another process trying to write a dataset will crash with a race condition if it tries to use the directory that just got deleted.

For a timestamp directory this is probably fine but things get more and more dangerous the higher up the directory tree you go. We haven’t really resolved how to deal with this.

If you are in a single-user butler environment who can guarantee that no other butler accesses are ongoing then after your call to prune datasets you can clean up the directories.

No. Pipelines can never delete datasets. See my answer on the other topic. When we run batch processing (With bps) we don’t use registry at all, we create a throwaway registry and then transfer datasets back on batch completion. In theory you can modify the transfer job to not include the intermediates (having configured a chained datastore where your intermediates go to scratch disk).

ktl · October 3, 2022, 5:10pm

It might be reasonable to have a manual --remove-empty-directories option on prune-datasets for times when you know that nothing else is ever going to write to them.

ameisner · October 3, 2022, 9:02pm

Thanks again, Tim! I really appreciate the responses/answers.

ameisner · October 3, 2022, 9:03pm

Thanks, K-T! This idea sounds good to me as an end user who doesn’t have to worry about broader design principles or other implications…