Plan for IDACs to clone Data Butler for Data Releases

george_beckett · June 11, 2025, 11:02am

Question from @StephenGwyn (CADC, Canadian IDAC team).

george_beckett · June 11, 2025, 11:05am

Hi Stephen,

In UK IDAC, we’ve experimented with different options for setting up the Data Butler registry (PostgreSQL database) once the Data Butler repository (files) have been deposited into a suitable storage unit. This is described in our technical report (LUSC-B-47), in Section 4.

We’re actually ingesting more DP0.2 files just now, so are revisiting these instructions and thinking about how to harden our Data Butler setup. We can provide updates on that, too, if useful.

Hope this helps,
George.

timj · June 11, 2025, 3:19pm

What was the original question? What are CADC planning to do regarding butler access?

george_beckett · June 12, 2025, 8:09am

Simply what is in the title. @StephenGwyn - can you elaborate?

timj · June 12, 2025, 3:42pm

Sorry. There was no question in the title

I do not know the plans for IDACs getting a copy of the DP1 butler. We do have tooling in place for exporting the registry and loading it into another postgres. The files total about 3TB so it’s not a lot of data. After DP1 first release date we can definitely talk to people about butler loading.

george_beckett · June 12, 2025, 4:10pm

Sorry, Tim. The question I copied read “What is the plan for IDACs to cLone Data Butler for Data Release?”, which I inferred meaning what is the easiest/ recommended way for IDACs to register files that will form part of the Butler repository into a Butler registry, ready for end-user access. This was the question we investigated in the report I linked to earlier in the chat. I think we need more info from Stephen. Thanks, George.

StephenGwyn · June 12, 2025, 7:59pm

@timj Yes, we’re talking about importing the registry.

The files themselves are transferred by Rucio, and will end up in our Storage Inventory system. We’ll need to write the bit that adapts Storage Inventory to Datastore.

george_beckett · June 13, 2025, 8:23am

When looking at this previously, we identified four options for populating the Butler registry (see p. 19):

“To complete the setup of the DP0.2 dataset, we needed to create a Data Butler and
ingest/register the files transferred from IN2P3 into it. There are four possible ways to do
the registration:
1. By importing a YAML file previously exported from a different Butler, which would
define the appropriate location of the data files. This was the method used previously,
when importing the VISTA-HSC collections, since we had access to the original
Butler and were provided with the export file.
2. By importing a database export of the original Butler registry into the PostgreSQL
registry used by the target Butler.
3. By doing a ‘raw’ ingestion: that is, having the Butler search through the directory tree
of the repository to create the registry from scratch.
4. By producing an ECSV file in a specific format with information about where each file
is located and giving it to the Butler as input.”

We elected to use Option 4 and (if I read correctly) used a script from our colleagues at IN2P3 to create the ECSV file - dp02_in2p3_to_somerville/notebooks/create_table_file.ipynb at main · aibsen/dp02_in2p3_to_somerville · GitHub.

We’ve recently ingested further image data into the UK IDAC: I’ll ask the team how they registered files on the more recent occasion.

eckhard.sutorius · June 13, 2025, 4:23pm

I’ve used that script as a base for a python script that also removes duplicated data (re-observed since there were problems) and only keeps the last. The ingest of the output ecsv file with ‘butler ingest-files’ then took only a minute or so for the deepCoadds. After that I created a collection chain so only this data was visible at the top level and not all the duplicated daily entries as well.

timj · June 13, 2025, 4:23pm

We have an export YAML file that we use to populate the DP1 repo on Google (exported from the source repo at SLAC) so that is the form we would expect you to use for importing at CADC.

What we don’t know is whether you are running a direct access butler or a client/server butler. We also don’t know the form of the URI you are using for your datastore root and whether you are expecting to be generating signed URLs.