Reading calexp catalogs using afw

PierreAstier · April 28, 2016, 2:56pm

jointcal obviously has to read calexp “src” catalogs. In the standard version of the simultaneous astrometry solver,
getting the catalogs into memory takes about 2 CPU seconds per catalog (containing typically 4,000 objects).
So far, in all use cases, getting the catalogs into memory dominates the overall time budget (both clock time and CPU).
We select objects in a python function, doing simple things as checking flags, S/N and chasing NaN’s, explicitly iterating over catalog objects. This takes about 1 CPU s per catalog. Finding the file, reading it and instanciating the afw objects takes about 1 s per catalog as well. Turning the afw data into what the astrometry code expects takes a negligible amount of time. I recently wrote a task that matches a whole exposure to an external catalog and tested it on CFHT/Megacam data (36 chips). It takes about 70 CPU s per exposure (with the python selection function),
and ~32 CPU s when bypassing the python selection function. Combinatorial matching and fitting the 36
WCS’s take about 2 s. The same code reading ASCII catalogs completes the job in ~5 CPU s, wasting half of that converting strings to double’s.

I have two questions:

is there an efficient way to select objects from an afw table in python?
Is there a way to speed up the loading of a calexp src catalog ? Spying on the process, I found about 700,000 calls to malloc per catalog, and about as many calls to cfitsio internals (e.g. ffgcv). The strace log shows that the same blocks of the data file are read several times. I wonder if the afw::table code actually reads the Footprint extensions of the SRC files, which we happen not to need. We just need to load the catalog itself, i.e. a ~2MB fits table.

Thanks in advance for any help.

jbosch · April 28, 2016, 3:07pm

Something like this should be reasonably efficient, since all of the loops will be C++:

mask = numpy.logical_and.reduce([catalog.get("flag_1"), catalog.get("flag_2"), catalog.get("flux") > 50.0])
subset = catalog[mask]

Note that any column-view access requires that the initial catalog be contiguous (it will be if you’ve just read it from disk), and it virtually guarantees that the new subset catalog will not be contiguous. You can deep-copy to make a contiguous one if you need it with:

contiguous_subset = subset.copy(deep=True)

If you do have to write a for loop over records in Python for something, the most important performance tip is to do key lookup in advance, e.g.:

flux_1_key = catalog.schema.find("flux_1").key
for record in catalog:
    record.get(flux_1_key)

Using a Key to get a record value is much faster than looking up a Key from its string name, so you really only want to do the latter once, outside the loop.

It does indeed read the Footprints, and you can disable that by passing flags=afw.table.SOURCE_IO_NO_FOOTPRINTS as an extra kwarg to Butler.get or Butler.put when reading/writing SourceCatalogs.

price · April 28, 2016, 3:12pm

Naoki Yasuda recently revised meas_mosaic to do the source reading with a process pool. If you’re CPU-bound then you might find a similar solution helpful.

PierreAstier · April 29, 2016, 10:35am

I implemented both advices, and it now loads 6 catalogs/s. the whole process took 70 s and is now down to 10 s.
Thanks Jim !

RHL · April 29, 2016, 6:42pm

Jim beat me to it, so I didn’t get the credit!

Is 10s reasonable given the disk I/O rates and data volumes? Or is there still work to be done?