Generation 3 butler tutorial

Tags: #<Tag:0x00007fedf47f4f38>

Hi all, I have recently started at RAL as a Scientific Computing Graduate in the Tier-1 group where I’m going to be curating and serving astronomy survey images on the Echo (S3) Object Store. I thought it would be a good idea to create a brief tutorial on the data butler’s command line tasks and the lsst.daf.butler python module for anyone to use or for anyone to correct me if I’m using any module incorrectly. Please find the notebook attached below.

Generation_3_Butler_notes_using_S3.html (684.8 KB)

6 Likes

Thanks for this. I have a few questions (noting that I manage the butler development)

  • There are a few ways to configure a butler. In your example you don’t need to specify the datastore at all in your seed configuration because you are colocating the butler.yaml in the S3 bucket. butler create will do the right thing for you and create the butler.yaml to use a datastore relative to the location of the butler.yaml. I think this means you only need to be talking about the registry in that seed file.
  • We have recently changed butler registry to support UUIDs rather than incrementing integers. This means that export and import will preserve the dataset ID and ingest-raws will always result in a repeatable UUID. It’s not enabled by default but is trivial to add to the seed file if that interests you. We are doing it to make it easier to merge registries during workflow execution.
  • I haven’t seen anyone configure a butler like this before so I had not spotted that if you specify a sqlite file explicitly in a path that is outside the butler root then we will not create the sqlite database. I’m not entirely sure if we should be creating it. If you are using an entirely local butler then we do create it automatically. Most people using S3 are using postgres.
  • If you are ingesting files again and again in testing you can create an index file using the astrometadata command and ingest-raws will use that rather than reading the file (this makes it feasible to ingest files from another S3 bucket).
  • Why are you copying the gen2 _mapper file into the gen3 repo? It’s not used by gen3 and gen2 can’t use S3.
  • In your notebook, (the line 79) the dataset types returned there are the curated calibration dataset types. You use butler write-curated-calibrations to add those but butler convert will also do it for you.
  • getURIs is the “proper” interface for retrieving URIs to a single dataset because butler supports composite disassembly. This means that you can configure your datastore such that on butler.put() it splits the dataset into its component parts. This means that for an Exposure it would write the image, variance, mask, wcs etc into separate files. The motivation for this is that you can then do butler.get("calexp.wcs",...) and for S3 that will be much much more efficient when disassembled since it will only download the small WCS file and not the entire file so that it can read a small part of it. In general composite disassembly is not the default but you can make it so by putting the relevant line in your seed yaml in the datastore section. getURI is there for the simple case and will break for you as soon as disassembly is turned on. raws are never disassembled so that’s always safe. If you have disassembled the getURIs dict will be filled in with keys like wcs mapping to a URI. getURIs returns the same answer as getURI in its first return value.
  • It was only yesterday that someone asked for a butler.getURIs(refs,...) API. I need to think about what that would return.
  • Note that you get a ButlerURI back from that which abstracts all the file access (it’s how we can have one datastore implementation working with S3 and local files). It’s not a simple string.
  • If you just want all the raw files copied locally then export is more than you need because it tries to give you additional data files to allow an import later. You should take a look at butler retrieve-artifacts and associated Butler API. For raws you can do the ingest again and get the same answer without having the associated metadata from the yaml file so retrieving and ingesting again should be the same as exporting and importing.
  • “auto” for transfer in S3 should always do a “copy”. It uses the default best suited for the URI scheme. We use “direct” for raws because we put raws in a special bucket which gives us the ability to destroy the user bucket and redo things from scratch without having to delete the raws.
3 Likes

Thank you for the feedback. This is very appreciated especially because you have first hand experience developing the code.

  • I’m guessing what you mean here is that I’ll only need to create a reg.yaml file were the contents will be
    "registry:
    db: sqlite:////home/vrs42921/kit_test_1.sqlite3
    " . I tried this and it works, thank you.
  • We plan to use postgresql in the long run. I’m just using sqlite3 to learn how to use the butler commands.
  • I think I misread a error. I’ve ran the command again without the _mapper file in the gen 3 directory and it works fine.
  • This was done when I ran the butler convert command. However the butler convert command didn’t finishing running due to memory constraints (4 GB RAM). The generation 2 repository I used was from the lsst pipeline tutorial. Is it normal for the butler convert command to use that much memory for a 5 GB data butler repository ???
  • In our use case we are going to be importing processed data (reruns directory) in S3. So that’s why I used raws for an example of exporting and importing so I’ll know how to do in the future. Another, reason why relates to the comment above the butler convert didn’t finish running so all I had was the raw data to test all the commands with. Is there is a better method for downloading the rerun directory between two systems (e.g HPC at location A to S3 at location B) than exporting and importing the data??

*If I don’t specify the transfer it gives me this error

Thank you again for all this feedback. I’ve went over my notes and I’ve corrected all of my mistakes.

Updated version of my notes
Generation_3_Butler_notes_using_S3.html (680.2 KB)

1 Like

Use transfer="auto" and it will mostly always do the right thing. The butler import command line does this. Unfortunately the Butler.import_() API was developed before auto existed and so defaults to None (which means “the files are already in place so no transfer needed”).

We recently reduced memory overhead for raw ingest by grouping. Some of this will depend on which version of the pipelines code you are using. I don’t know as we’ve ever done a conversion using such a small memory machine. Maybe @jbosch has some thoughts on memory usage since he did the big conversions at NCSA.

Are you talking about gen2 reruns at A to gen3 S3 in B? That has to be done through butler convert – we always convert locally into fresh repos because it’s a lot more efficient to do a bulk s3 copy after the conversion. Once you have a gen3 repo the only option at the moment is to do export/import. I am currently finalizing a butler transfer-run command that will do this in one go.

2 Likes

Hi again,
I’ve updated the initial tutorial, where I have created a better example of the butler convert command line task using more arguments and I have created a better example of exporting data (more than one collection).

Generation_3_Butler_notes_(S3).html (1.3 MB)

I have a couple questions.

  1. Is there an export all functionality or do I just have to list the top level directory structure in each collection like I’ve did above ?
  2. I have been recently being doing some upload speed using LSST test data and I have manged to get an average upload speed of 42 MBytes/s using rclone where I used 50 parallel transfers to upload the data in S3. When I used the butler import_ function on python, it had an overall upload speed of 11 MBytes/s. Is the decrease in transfers speed due the fact that the import_ function has to also populate a sql file (registry) ? and also does the import_ function support parallel uploads ?
2 Likes