Loading large numbers of catalogs with the Butler

danielsf · August 8, 2019, 4:20pm

As a part of Commissioning, we need to write scripts to test things like photometric and astrometric repeatability. This means getting all of the observations of a region of sky and looking at the distributions of the different measurements of individual objects in that region. I have been doing this with code like

for data_id in list_of_data_id_in_region:
    src = butler.get('src', dataId=data_id)
    calexp = butler.get('calexp_photoCalib',
                        dataId=data_id)
   ...analysis code...

I am finding that the butler.get steps take about 0.1 second per data_id. This starts to become a problem when you are dealing with a few 10**4 data_ids (as in the HSC UDEEP field). Is there a more efficient way to load data from a large number of visits with the Butler, or do we just need to eat the cost (presumably running a pre-burner to load all of the data we want into a more columnar form before doing any analysis)?

Thanks.

ktl · August 8, 2019, 4:41pm

Once we are routinely producing Parquet files compliant with the Science Data Model, they should be more efficient to retrieve than afwTable-in-FITS-binary if you only want a limited set of columns. I believe this is coming soon (before the end of the calendar year).

parejkoj · August 8, 2019, 8:54pm

As part of DM-9071, I’d experimented with reading catalogs and other butler data in parallel with Threads and Processes. Unfortunately, threads did not help, and I ran into inefficiencies with pickling for processes. I’d like to give another try at seeing whether catalog.readFits is threadsafe now that we’ve removed Citizen.

Have you looked at the code in validate_drp that already computes repeatability metrics and does the n-way matching?