Introducing ingestDriver.py

price · January 18, 2018, 1:05am

ingestImages.py (in pipe_tasks) is used to ingest images into a data repository, but it operates serially over the provided images. If you’ve got a cluster filesystem that can go a bit faster, you can use the new ingestDriver.py (in pipe_drivers; introduced in DM-13244). This operates on the images in parallel, allowing the process to complete more quickly.

For example, here’s how I recently added new HSC images to our data repo at Princeton:

pprice@tigressdata:/tigress/HSC/HSC-SSP $ ingestDriver.py /tigress/HSC/HSC '2017-*/*.fits' --cores 20 @$HOME/LSST/obs_subaru/hscIngestImages.badargs --mode=link -c clobber=True allowError=True register.ignore=True

The operation is I/O-limited, so you can use more “cores” than you have physical cores, but be warned that you don’t want to use too many “cores” (or spread the job out over lots of cluster nodes), as cluster filesystems don’t do well when you’re making lots of short reads like this operation does. But this should allow you to get a bit of parallelisation to get the job done a bit faster when you’ve got a lot of files.

RHL · September 12, 2018, 2:03pm

Note that Paul quoted '2017-*/*.fits' so that the glob expansion was done in python, not by the shell. I’ve seen SEGVs resulting from over-long argument lists on lsst-dev