How is input and output data laid out in USDF?

nsevilla · February 24, 2025, 5:35pm

Hello. A noob question here on where the data lives in USDF. I have some familiarity on accessing the data through the butler, but would like to know, for instance for recent HSC-PDR2 or OR4 runs, how the data is actually laid out in the directory structure of S3DF… assuming it works that way. That is to understand how to plan/understand data transfers to NERSC/IDACs upon data release from the USDF. butler query-collections seem to point me to some directories but it is not always clear to me from the output where the raw data and processed data are. Thanks.

price · February 24, 2025, 7:41pm

The layout is an implementation detail, subject to change. You should use the butler to access the data. This is especially true when the implementation uses an object store like S3.

nsevilla · February 24, 2025, 8:05pm

I see. I guess for Rucio + FTS there will be a way in which the transfer communicates directly with the object storage then. I was thinking of a non-Rucio global transfer solution for DP1 or OR4 for instance, I guess that would mean staging somehow in disk first. Thanks!

timj · February 24, 2025, 8:33pm

We configure Rucio to preserve the butler directory structure so that when you receive the files they will be in the same place in the tree. At USDF they are in an S3 bucket and you can get the URI from a random butler dataset using the butler to find out which bucket. We want Rucio to retain the directory hierarchy so that you can also receive a copy of the butler database without having to patch it.