Thanks for looking at this. It’s really great. Some comments:
What does create mean for ECHO? Are you using local sqlite file but ECHO datastore?
What is being imported in the Import step?
For ingest-raws are you ingesting from local disk to ECHO or ingesting from the raw files in the ECHO bucket? If ingesting from local then there will be the transfer overhead (and I haven’t parallelized the S3 copies). If ingesting from the bucket each file has to be downloaded so the header can be read from it to extract the registry information. This can be sped up considerably by writing index files for the raw files (astrometadata write-index) and putting those in the bucket as well. The ingest-raws command will then download the small JSON files rather than the huge fits files.
define-visits is entirely registry based so I would not expect any difference between the two.
write-curated-calibrations reads lots of ECSV/YAML files into memory and then writes out transformed FITS files. This will write each file locally and then transfer it to the bucket so I assume that’s where the slow down is coming from. It does not attempt to batch up those transfers (datastore.put can’t be given multiple DatasetRef for now).
make-discrete-skymap is going to be slower because currently we do not cache the file locally when using S3. This means that the get of calexp.wcs downloads the full file, reads the WCS, then deletes the local file, and then calexp.bbox does the same thing. You could enable caching for calexp by changing the configuration file for datastore but I haven’t enabled it because I don’t have cache expiry. One thing I would be very interested in is how your timing changes with ECHO if you use composite disassembly – in that datastore mode I wrote components out as separate files such that calexp.wcs only downloads the WCS part. You can see how to do this by looking at pipelines_check.
Yes, the registry is stored locally for ECHO . I used --seed-config “path to sql file” --override
The reference catalog from the converted gen 2 LSST science pipeline tutorial.
In this case the raws files are stored locally and ingested into ECHO. I’ll try the other method you have suggested and I’ll assess the timing difference.
I’ll look into using composite disassembly and try to get some analysis done by the end of next week.
Recreating the LSST Science pipeline tutorial (gen 2) only using Generation 3 command line tasks and the pipetasks using release w_2021_30. I still didn’t managed to get the measure task working therefore I also couldn’t do the mergeMeasurements task and forcedPhotCoadd task(different error to what I got last time)
Note that v22 is not at all related to w_2021_30, and using packages from these two in combination is likely to fail. As it says at Release 22.0.0 (2021-07-09) — LSST Science Pipelines, the release is based on w_2021_14 plus a patch.
Hey @joshuakitenge, I work on all the gen3 processing user facing bits and would be happy to help you out on this, it might take me a few minutes to get up to speed though.
If I understand correctly, you are trying to follow Getting started with the LSST Science Pipelines — LSST Science Pipelines in context of gen3? If that is the case I can understand why you are having trouble, all of that is not really applicable. To make things more confusing, many of the tasks contain code to make them run in gen2 and gen3, so its possible to end up somehow mixing and matching in a confusing way (for instance command line tasks simply do not exist in a gen3 framework).
I’m going to try to pull together some information for you, but I wanted to drop a note to let you know I am working on it. While I do that, what are you interested in specifically, processing data? ingesting data? running analysis against processed data?
Hey, @natelust, I have recently started at STFC, RAL as a Scientific Computing Graduate in the Tier-1 group where I’m going to be curating and serving astronomy survey images on the Echo (S3) Object Store (ECHO is a Ceph storage system that can be accessed through an S3 interface ).
I’m interested in all 3 of these.
Processing data
I’m interested in the overall time between processing the data and storing it locally (Openstack vm ) vs on a ECHO (S3) Object store. To assess if using ECHO (S3) Object store is a viable option compared to local. This is to help alleviate disk pressure (e.g. on a HPC) for Scientist that are processing the data in the future (GEN3).
I’m also interested in studying and documenting as much information about Generation 3. To help LSST:UK and hopefully the whole of LSST community understand LSST science pipelines better.
Ingesting data
One of the aims of my work to serve processed and raw LSST data on ECHO (S3) object store.
I’m planning on testing ingesting the raw files using astrometadata write-index. To optimize the workflow
Running analysis against processed data
Currently I’m testing authentication/authorisation workflows that allow federated end user (Astronomers) access the data that would be stored in ECHO through a jupyter notebook.
The jupyter notebook workflow works using master S3 credentials into ECHO, however this is not the workflow we want to give to our end users. Currently we are looking workflows where we authenticate user using an IAM service and through IAM service the end user are granted temporary S3 credentials(read access) so they can access the butler repo stored on ECHO to do their analyse.
We are committed to supporting object stores for processing and welcome any feedback you give us. I will have to sort out file caching at some point because we are planning on adding quantum clustering to the workflow graph so that a single job can run multiple quanta – this will make use of caching by noting that the next quantum will be able to read the output locally from the previous quantum without having to fetch it from the object store.
I am also wondering whether we should add asyncio support to ButlerURI and allow a bulk butler.put that can parallelize the file transfers on job completion. Allowing a bulk ingest that can use ButlerURI.transfer_from() with asyncio might be useful as well.
There are two formats of index files so you get to choose. One stores the translated metadata and the other stores all the FITS headers. Translated metadata is he smallest and fastest to process but we probably will use the raw header option in the index because that form allows us to modify the header translation and reingest without rewriting the index file (so the index file can become a permanent read-only artifact).
We haven’t used these index files “in real life” and there has been talk of allowing an option that writes the index files to a parallel directory tree rather than storing them directly alongside the raw data. Doing that makes it easier to delete and rebuild them without having to touch the curated raw folders.
A lot of this also assumes that these index files are created incrementally as new data arrives (and so the headers can be read during data transfer) but of course creating these index files from pre-existing data in buckets still requires download of the entire file. I have been considering the ability to read the first N-bytes from the file to minimize the download. This should work fine for LSSTCam data but for DECam it won’t work because DECam files store multiple detectors in a single file.
Our plan for the IDF is to have a butler client/server that uses the science platform A&A system to access it and returns signed URLs that clients can use for reading from and writing to the object store. Science users will not have direct access to the registry SQL database.
Hi, I have redone the timing test for Local storage and ECHO storage for the LSST pipeline tutorial using generation 3 commands using the version w_2021_30 of the stack. I have also tested two configurations of CephFS. One with the registry within CephFS and the other with the registry in local storage.
The composite disassembly run was 38.4 GB.
The test run without composite disassembly was 12.4 GB.
was this increase in the data size expected ?
(processes = 1 on the pipetasks)
Thanks for doing this. I am confused by some of the results.
For example write-curated-calibrations should be no different with or without composite disassembly because none of the curated calibrations are disassembled and yet somehow it’s 10% slower. I may well have to implement a butler.put that allows multiple datasets to be stored at once so that the datastore can parallelize uploads.
Coaddition is going to be slow because each component is downloaded separately from S3 and then combined into a single Exposure. It’s good to see that there is no slowdown with any of the runs with disassembly that used a “local” filesystem. I think I may have to implement an asyncio parallel file retrieval option (and storage option) so that we can be sending these files to S3 simultaneously. It is interesting how make-discrete-skymap is barely any faster despite it downloading significantly less data.
Lots of separate FITS files is going to be bigger than one file, although I wasn’t expecting a factor of 3.
Rewind, reset on V22. Nate, I have just completed my second run-through of the tutorial for V22. I am now planning to repeat the process for V22 with gen3. As I understand, I need to reinstall the V22 pipeline software, THEN, apply a patch to incorporate the latest gen3 features. Is this correct? Also, are the tutorials available to guide us through this new process? Please advise. Many thanks, Fred, Dallas, tx
No. You can use gen3 in v22. It’s 2 months out of date but it’s still possible to use it. If you are serious about gen3 though you should install a recent weekly. You don’t need to patch anything but instead of installing v22_0_1 you would install w_2021_31 (or whatever the newest is). The one caveat being you would need to use a different conda environment so when you do the newinstall step you need to use the newinstall.sh corresponding to the weekly you want to install (rather than using the v22 tag, use the w.2021.31 tag).
Tim, I ran the latest weekly version of the new install.sh:
bash ./lsst-w.2021.31/scripts/newinstall.sh -ct
I installed the latest weekly lsst software:
eups distrib install -t w_2021_31 lsst_distrib
I have a reference to an html file that I believe is intended to guide me through the latest Gen3 tutorial:
Gen3run-V22.html
Can affirm that my tutorial html doc [above] is what I should follow?
many thanks, Fred, Dallas, tx
That is an “unofficial” tutorial from @joshuakitenge but should be fine as a first step. There are also all the DP0.1 tutorials that I believe you’ve already located, in particular notebook 4.
The replacements for the gen2 tutorials at pipelines.lsst.io are still being worked on but vacations slow things down a bit. We are also working on documentation for how to construct pipelines.
Good deal Tim. I’m going to work with notebook 4, as I enjoy using jupyter notebooks, since I can run a few lines, look at some objects, then repeat.
Thanks, Fred, Dallas tx
Today I merged a fix for some of the slow down with composite datasets and S3. Turns out that we weren’t caching the boto3 client and so every single get was taking far longer than it should have done. Should be in the weekly coming out tonight.
I’m continuing with my gen3 execution of the pipeline proceses.
I’m following steps in Gen3run-v22.html. I’ve successfully registered the HSC instrument:
butler register-instrument GEN3_run/ lsst.obs.subaru.HyperSuprimeCam
Now I’m working on attempting the butler import with many errors:
butler import GEN3_run/ ~/lsst_stack/DATA_gen3/ --export-file exports.yaml
I believe I may need to “hack” a convert from the gen2 to gen3 repository. I have reincarnated the gen2 repository per the original tutorial.
I need an example of the butler convert command. I’ve fabricated this:
butler convert –-gen2root /Users/fredklich/Downloads/lsst_stack/testdata_ci_hsc /Users/fredklich/Downloads/lsst_stack/GEN3_run
I’m not clear on these points:
Are we needing to JUST reincarnate the Ref Catalogs from gen2 or
Both the gen2 HSC repository data and the Ref Catalogs.
Can you provide an example of the above butler convert command.
Thanks, Fred, Dallas, TX