Obtain only image metadata with butler.get

Hi, I am wondering if there is a way to obtain just the metadata (or FITS headers) from a calexp dataset type using the butler. Right now, I am scraping through just the metadata of many exposures, something like:

refs = registry.queryDatasets("calexp", collections=collection)
for ref in refs:
    calexp = butler.get(ref, collections=ref.run)
    metadata = calexp.getMetadata()
    del calexp # so I don't run out of memory

This is going quite slowly as the butler is loading each exposure into memory. I see there is a parameters option in the butler.get docstring:

    datasetRefOrType: 'Union[DatasetRef, DatasetType, str]',
    dataId: 'Optional[DataId]' = None,
    parameters: 'Optional[Dict[str, Any]]' = None,
    collections: 'Any' = None,
    **kwds: 'Any',
) -> 'Any'
Retrieve a stored dataset.

datasetRefOrType : `DatasetRef`, `DatasetType`, or `str`
    When `DatasetRef` the `dataId` should be `None`.
    Otherwise the `DatasetType` or name thereof.
dataId : `dict` or `DataCoordinate`
    A `dict` of `Dimension` link name, value pairs that label the
    `DatasetRef` within a Collection. When `None`, a `DatasetRef`
    should be provided as the first argument.
parameters : `dict`
    Additional StorageClass-defined options to control reading,
    typically used to efficiently read only a subset of the dataset.

Can I use this parameters to select just the metadata? I’ve tried a test

butler.get(ref, collections=ref.run, parameters={"test": "123"})

and get an error:

KeyError: "Parameter 'test' not understood by StorageClass ExposureF"

and if I try a test with a parameter I find by examining ref.datasetType.storageClass: StorageClassExposureF('ExposureF', pytype='lsst.afw.image.ExposureF', delegate='lsst.obs.base.exposureAssembler.ExposureAssembler', parameters=frozenset({'origin', 'bbox'}),

butler.get(ref, collections=ref.run, parameters={"origin": '123'})

I get an error

/home/admin/lsst/lsst_22_0_0/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/obs_base/22.0.0+6225f1ba97/python/lsst/obs/base/formatters/fitsExposure.py in readFull(self, parameters)
    243         self._reader = self._readerClass(fileDescriptor.location.path)
--> 244         return self._reader.read(**parameters)

TypeError: read(): incompatible function arguments. The following argument types are supported:
    1. (self: lsst.afw.image.ExposureFitsReader, bbox: lsst.geom.Box2I = Box2I(minimum=Point2I(0, 0), dimensions=Extent2I(0, 0)), origin: lsst.afw.image.image.ImageOrigin = <ImageOrigin.PARENT: 0>, conformMasks: bool = False, allowUnsafe: bool = False, dtype: object = None) -> object

Invoked with: <lsst.afw.image.ExposureFitsReader object at 0x7febb2bfc8b0>; kwargs: origin='123'

which tells me parameters controls the constructor of the ExpsoureFitsReader…if that’s true, is there a constructor that loads just the metadata?

This is all the snooping I’ve done so far.


What metadata are you looking for in particular? Information about the exposure can be gotten with butler.get("calexp.visitInfo"), returning a VisitInfo object, which has a number of properties (e.g. date, exposureTime, observatory, weather, boresightRaDec). You can similarly get the wcs, photoCalib, bbox, and filterLabel objects, which together describe most of the information relevant to the exposure.

I don’t know that we have an explicit listing in the docs of all of the exposure components that can be gotten this way: we definitely should, if we don’t.

The parameters can be used to obtain a subset of the image. You can define a bounding box and get a cutout.

For example, to get all the components associated with a calexp in the ci_hsc_gen3 output repository:

$ butler query-dataset-types --components $CI_HSC_GEN3/DATA calexp.*

Here the calexp.metadata is the FITS header.

Thanks Tim and John, calexp.metadata (along with calexp.visitInfo and calexp.wcs) works great for what I need. I wasn’t aware this functionality was available.

Timing a butler.get with a calexp data ref is about 1.5 - 2 seconds and a butler.get with a calexp.visitInfo (or calexp.wcs) is around 700ms. I thought it would be much faster than that, but that should give me at least 2x speedup. Thanks!

Where are you doing this test? On the IDF it has to download the whole file to read a little bit of it because cfitsio can’t access a file directly on a Google bucket. There is local file caching on IDF so if you butler.get the visitInfo and then the wcs it won’t download the file twice.

I was doing this on an internal cloud deployed JupyterHub, similar to the IDF. Files are stored in a bucket. I tested this on (spinning) disks and the timing is 900ms for calexp and 33.8ms for calexp.visitInfo, so it looks like the latency is from the object download like you suggest.

If you turn on the timer.lsst.daf.butler logger at DEBUG level it will report the relevant times for downloading versus file reading.