Generation 3 Butler notes using S3¶

Creating a generation 3 butler repository ¶

• Step 1 : Creating registry file (reg.yaml file)

• Firstly, we need to create a SQL file for the registry (e.g vi test.sqlite3) (postgresql is often use for S3)

• Then, we have to create an S3 bucket on echo which will be the butler repository. I used Rclone (https://rclone.org/docs/) to do this "rclone mkdir remote:bucket_name"

• Now we have all of this, create a new file and call it reg.yaml. Within that file you have the path to the SQL file. Example

• registry:
  db: sqlite:////home/test.sqlite3 
• Step 2 : Configuring the butler repository

• Now we have got the butler.yaml file we can create an empty Gen3 Butler repository.

• We do this by using "butler create" which is a command line task (https://pipelines.lsst.io/modules/lsst.daf.butler/scripts/butler.py.html)

• We run:

• » butler create s3://bucket_name --seed-config reg.yaml --override

• where s3://bucket_name is the REPO which is the URI or path to the new repository

• Now we have created a generation 3 Butler repository, if we check our s3 butler repository will see that the butler.yaml is now in the repository. We can also check the sql file and see all the tables loaded in which help query the dataset

Adding an instrument to the GEN 3 butler repository ¶

• Step 1: Find the instrument class

• Step 2: Running the "register-instrument" Command

• » butler register-instrument s3://bucket_name lsst.obs.subaru.HyperSuprimeCam

• where s3://bucket_name is the REPO which is the URI or path to the new repository and lsst.obs.subaru.HyperSuprimeCam is the instrument class

• Note that for butler subcommands once an instrument is registered you can refer to that instrument by the short name
(e.g. » butler write-curated-calibrations s3://bucket_name HSC )

Ingest raw frames into from a directory into the butler registry ¶

• Step 1: Adding an instrument to the GEN 3 butler repository

• Make sure that an instrument has been added the into the GEN 3 butler repository (Look the above for instructions to how to add an instrument to the butler repsitory)

• Step 2: Running the "ingest-raws" Command

• » butler ingest-raws s3://bucket_name /home/lsst_stack/testdata_ci_hsc/raw

• where s3://bucket_name is the REPO which is the URI or path to the new repository and /home/lsst_stack/testdata_ci_hsc/raw is the LOCATIONs specifies files to ingest and/or locations to search for files.

Defining the visits system in a butler repository ¶

• Generation 2
• In gen2, visit was effectively the fundamental concept of an observation and all the instruments treated visit as being a synonym for exposure or observation.

• Generation 3

• In gen3 we separate the two concepts so “exposure” means one self-contained observation that could be processed on its own, and “visit” is effectively a scheme for grouping exposures.

• In gen3 we currently have two “visit systems” in place and they are set up after the raw data have been ingested. The two are either “one visit == one exposure” or else create visits by using the “group name” – we put a groupId header in each file so we can tell if the observing script has decided that the two exposures should be processed together. We haven’t really fleshed out this process because we’ve never really taken any multi-exposure visit data (LSSTCam is required to support two exposure visits or one exposure visits). All the current pipelines assume a visit is a single exposure and there is a step in the pipeline that morphs the exposure into a visit. All the pipelines require that visits are defined so after ingest if you are wanting to use our standard pipelines you will need to run butler define-visit.

• Step 1: Instrument class

• Step 2: Running the "define-visits" Command

• » butler define-visits s3://bucket_name HSC

• where s3://bucket_name is the REPO which is the URI or path to the new repository and HSC is the instrument class

Convert a Butler gen 2 repository into a gen 3 repository ¶

• Step 1: Set up a Gen 3 butler repository(Look above for instructions)

• Step 2: Running the "convert" Command

• where s3://bucket_name is the REPO which is the URI or path to the new repository, --gen2root is the root path of the gen 2 repo to be converted and/home/lsst_stack/DATA is the path to the gen 2 repo, --processes sets the amount of proccessing cores use in the conversion, --reruns is the path to the rerun directories and --calibs is the path to the calib directory.

• The tutorial for create a gen 2 repository is here https://pipelines.lsst.io/getting-started/data-setup.html

Importing data accross two GEN 3 repositories ¶

• Step 1: Export the data

• Firstly you will have export data from the repository with the data currently in it.(how to export data)

• Step 2: Running the "import" Command

• » butler import s3://bucket_name_new s3://bucket_name --export-file exports.yaml

• where s3://bucket_name is the REPO which is the URI or path to the repository with the data and s3://bucket_name_new is the REPO which is the URI or path to the repository ehre you want to put your data .

Butler using a jupyter notebook ¶

Accessing the data registry ¶

The registry is a good tool for investigating a repo (more on the registry schema can be found here). For example, we can get a list of all collections, which includes the HSC/raw/all collection that we were using before

now that we "know" that HSC/raw/all exists, let's create our butler with this collection:

We can also use the registry to get a list of all dataset types

We suspect that this is all datasetTypes that the processing has tried to create during the processing. There may be intermediate products that were created during processing, but no longer exist.

It is now possible to get all DatasetRef (including dataId) for a specific datasetType in a specific collection with a query like the one that follows

Ok, now that we know what collections exist (HSC/raw/all in particular), the datasetTypes that are defined for that collection, and the datasetRefs (which contain dataIds) for data products of the requested type. This is all the information that we need to get the dataset of interest.

From the list above, I choose index 16 and with this we will find the dataId

DatasetRef is a combination of dataset type and dataId and can refer to an explicit dataset in a specific run (if ref.dataId is defined)

Plotting ¶

How to create a table using the data