Hi all, I have recently started at RAL as a Scientific Computing Graduate in the Tier-1 group where I’m going to be curating and serving astronomy survey images on the Echo (S3) Object Store. I thought it would be a good idea to create a brief tutorial on the data butler’s command line tasks and the lsst.daf.butler python module for anyone to use or for anyone to correct me if I’m using any module incorrectly. Please find the notebook attached below.
Thanks for this. I have a few questions (noting that I manage the butler development)
There are a few ways to configure a butler. In your example you don’t need to specify the datastore at all in your seed configuration because you are colocating the butler.yaml in the S3 bucket. butler create will do the right thing for you and create the butler.yaml to use a datastore relative to the location of the butler.yaml. I think this means you only need to be talking about the registry in that seed file.
We have recently changed butler registry to support UUIDs rather than incrementing integers. This means that export and import will preserve the dataset ID and ingest-raws will always result in a repeatable UUID. It’s not enabled by default but is trivial to add to the seed file if that interests you. We are doing it to make it easier to merge registries during workflow execution.
I haven’t seen anyone configure a butler like this before so I had not spotted that if you specify a sqlite file explicitly in a path that is outside the butler root then we will not create the sqlite database. I’m not entirely sure if we should be creating it. If you are using an entirely local butler then we do create it automatically. Most people using S3 are using postgres.
If you are ingesting files again and again in testing you can create an index file using the astrometadata command and ingest-raws will use that rather than reading the file (this makes it feasible to ingest files from another S3 bucket).
Why are you copying the gen2 _mapper file into the gen3 repo? It’s not used by gen3 and gen2 can’t use S3.
In your notebook, (the line 79) the dataset types returned there are the curated calibration dataset types. You use butler write-curated-calibrations to add those but butler convert will also do it for you.
getURIs is the “proper” interface for retrieving URIs to a single dataset because butler supports composite disassembly. This means that you can configure your datastore such that on butler.put() it splits the dataset into its component parts. This means that for an Exposure it would write the image, variance, mask, wcs etc into separate files. The motivation for this is that you can then do butler.get("calexp.wcs",...) and for S3 that will be much much more efficient when disassembled since it will only download the small WCS file and not the entire file so that it can read a small part of it. In general composite disassembly is not the default but you can make it so by putting the relevant line in your seed yaml in the datastore section. getURI is there for the simple case and will break for you as soon as disassembly is turned on. raws are never disassembled so that’s always safe. If you have disassembled the getURIs dict will be filled in with keys like wcs mapping to a URI. getURIs returns the same answer as getURI in its first return value.
It was only yesterday that someone asked for a butler.getURIs(refs,...) API. I need to think about what that would return.
Note that you get a ButlerURI back from that which abstracts all the file access (it’s how we can have one datastore implementation working with S3 and local files). It’s not a simple string.
If you just want all the raw files copied locally then export is more than you need because it tries to give you additional data files to allow an import later. You should take a look at butler retrieve-artifacts and associated Butler API. For raws you can do the ingest again and get the same answer without having the associated metadata from the yaml file so retrieving and ingesting again should be the same as exporting and importing.
“auto” for transfer in S3 should always do a “copy”. It uses the default best suited for the URI scheme. We use “direct” for raws because we put raws in a special bucket which gives us the ability to destroy the user bucket and redo things from scratch without having to delete the raws.
Thank you for the feedback. This is very appreciated especially because you have first hand experience developing the code.
I’m guessing what you mean here is that I’ll only need to create a reg.yaml file were the contents will be
"registry:
db: sqlite:////home/vrs42921/kit_test_1.sqlite3
" . I tried this and it works, thank you.
We plan to use postgresql in the long run. I’m just using sqlite3 to learn how to use the butler commands.
I think I misread a error. I’ve ran the command again without the _mapper file in the gen 3 directory and it works fine.
This was done when I ran the butler convert command. However the butler convert command didn’t finishing running due to memory constraints (4 GB RAM). The generation 2 repository I used was from the lsst pipeline tutorial. Is it normal for the butler convert command to use that much memory for a 5 GB data butler repository ???
In our use case we are going to be importing processed data (reruns directory) in S3. So that’s why I used raws for an example of exporting and importing so I’ll know how to do in the future. Another, reason why relates to the comment above the butler convert didn’t finish running so all I had was the raw data to test all the commands with. Is there is a better method for downloading the rerun directory between two systems (e.g HPC at location A to S3 at location B) than exporting and importing the data??
*If I don’t specify the transfer it gives me this error
Use transfer="auto" and it will mostly always do the right thing. The butler import command line does this. Unfortunately the Butler.import_() API was developed before auto existed and so defaults to None (which means “the files are already in place so no transfer needed”).
We recently reduced memory overhead for raw ingest by grouping. Some of this will depend on which version of the pipelines code you are using. I don’t know as we’ve ever done a conversion using such a small memory machine. Maybe @jbosch has some thoughts on memory usage since he did the big conversions at NCSA.
Are you talking about gen2 reruns at A to gen3 S3 in B? That has to be done through butler convert – we always convert locally into fresh repos because it’s a lot more efficient to do a bulk s3 copy after the conversion. Once you have a gen3 repo the only option at the moment is to do export/import. I am currently finalizing a butler transfer-run command that will do this in one go.
Hi again,
I’ve updated the initial tutorial, where I have created a better example of the butler convert command line task using more arguments and I have created a better example of exporting data (more than one collection).
Is there an export all functionality or do I just have to list the top level directory structure in each collection like I’ve did above ?
I have been recently being doing some upload speed using LSST test data and I have manged to get an average upload speed of 42 MBytes/s using rclone where I used 50 parallel transfers to upload the data in S3. When I used the butler import_ function on python, it had an overall upload speed of 11 MBytes/s. Is the decrease in transfers speed due the fact that the import_ function has to also populate a sql file (registry) ? and also does the import_ function support parallel uploads ?
There isn’t an “export all” that takes all the dimensions, datasets, dataset types, collections and other items and dumps them somewhere. In sqlite it’s easier to copy the datastore and sqlite file to a new location so it’s been a relatively low priority. Now that we have to write a command to migrate from the old integer dataset IDs to the new UUID form using sqlite then we will have to have something that looks very much like “export all”.
The low-level ButlerURI.transfer_from() API is currently not async-aware. Slow transfer speeds have been coming up recently so it’s definitely something I want to look at.
An alternative approach for raw ingest is is to do something like:
Run astrometadata write-index -c metadata on the raw data files.
Transfer the raw files (and the index files) to S3 using standard s3 tooling and put them in a special bucket location.
Now run butler ingest-raws --transfer=direct REPO RAW-BUCKET – this will ingest the file using the full URI to them without transferring them explicitly into the butler datastore and it will read the index files to extract the metadata so it won’t have to read all the files.
Doing it this way means you can set up multiple butler repos in S3 (and also delete them) without having to continually upload the files from local disk.
For an import/export you should be able to export from a repo to an S3 bucket and import from an S3 bucket. I think boto3 still downloads each file to the client as part of a copy so it’s not great performance (but maybe since I last looked at this boto3 has got cleverer).
Now for some comments on the web page:
Maybe note that for butler subcommands once an instrument is registered you can refer to that instrument by the short name (eg for write-curated-calibrations – currently butler convert will run that command automatically).
Might be worth showing butler --help in your example so that people can see that many of the abilities that are demonstrated later in the notebook can be done on the command line.
In the “Importing data across” section you refer to ingest-raws command and not import.
For code examples you will sometime see from lsst.daf.butler import Butler (especially since that is what the middleware programmers use for all our code and examples). Some butler examples from scientists seem to prefer using dafButler.Butler – I imagine some of that is history to distinguish from dafPersistence.Butler from Gen2 but for a pure Gen3 example I don’t think it’s necessary.
You may want to explain that a DatasetRef is a combination of dataset type and dataId and can refer to an explicit dataset in a specific run (if ref.id is defined).
Not sure if you want to explain that visit_system field that comes back. For HSC it doesn’t matter but for LSSTCam it can make a difference.
I think the collections argument for queryDatasets might be defaulted to the one that you used to create the Butler so you might not need it in your examples.
If you have an explicit ref returned by queryDatasets you can use butler.getDirect() to get the dataset. This bypasses the registry and gets the thing directly using the ref.id. butler.get() always queries the registry and checks that the supplied dataset ref is consistent with what is in registry for that collection. This is because butler.get() has to be able to work with separate dataset type and dataId and wants consistency. The other thing butler.get() does is allow an explicit dataId to contain things like a detector name or exposure OBSID rather than a detector number and exposure integer.
In cell 13, if you want to ensure that the source catalog you are using matches the calexp you want to use something like butler.get("src", dataId=calexp_ref.dataId) – The current example looks like you are getting index 16 from the result and hoping that they match (which we can see they don’t because the visit in the dataId is different).
In the more plotting section you then to use the ref.dataId to get the associated calexp so I’m confused at the ordering. Why not use the calexp dataId to get the source and then reuse the calexp you already had?
There are a couple of recent commands that might interest you (they have Butler APIs to match):
butler retrieve-artifacts can download data files from a repository to a location of your choice (including S3 bucket). This lets you quickly just grab a few files (say because you want to send them to a collaborator).
butler transfer-datasets transfers datasets from one butler repo to another. Current limitation is that it does not create missing dimensions records or dataset types during the transfer but those are things we can think about adding later. It’s been developed to allow registry-free pipeline processing to occur.
Thanks @timj for the comments. I resolved most of the comments in my latest draft of Generation 3 Butler notes. Which is attached above in HTML (.html), Notebook (.ipynb) and PDF (.pdf) format. I didn’t address the visit system as I don’t understand how it will differ for LSST.
In gen2 visit was effectively the fundamental concept of an observation and the all the instruments treated visit as being a synonym for exposure or observation.
In gen3 we separate the two concepts so “exposure” means one self-contained observation that could be processed on its own, and “visit” is effectively a scheme for grouping exposures.
In gen3 we currently have two “visit systems” in place and they are set up after the raw data have been ingested. The two are either “one visit == one exposure” or else create visits by using the “group name” – we put a groupId header in each file so we can tell if the observing script has decided that the two exposures should be processed together. We haven’t really fleshed out this process because we’ve never really taken any multi-exposure visit data (LSSTCam is required to support two exposure visits or one exposure visits). All the current pipelines assume a visit is a single exposure and there is a step in the pipeline that morphs the exposure into a visit. All the pipelines require that visits are defined so after ingest if you are wanting to use our standard pipelines you will need to run butler define-visit.