How does the butler support compression?

I just noticed that if we compress files on disk (using gzip in this case) that the butler can’t read them, although cfitsio can. That is,

calexp = butler.get(“calexp”, dataId)

fails but

calexp = afwImage.ExposureF(butler.get(“calexp_filename”, dataId))

works.

SDSS’s butler-equivalent checked for a variety of suffixes and handled this transparently. Are there plans for the butler to do something similar?

See https://jira.lsstcorp.org/browse/DM-4924

@RHL

  1. Would you suggest that this should apply to only images, or to any file?
  2. Should the Butler internally keep track of different compressed formats to try,
    or should this be defined as an optional list in the camera mapper?
exposures: {
    raw: {
        template:    "raw/%(runId)s/%(object)s/%(date)s/%(filter)s/%(visit)d%(state)1s.fits[%(extension)d]"
        template_suffix: ["", ".gz", ".fz"]

If the Butler supports it for any file, perhaps a top-level suffixes could be defined

suffixes: ["", ".gz", ".fz"]

(I actually don’t know if you can specify lists in the policy files, but let’s leave that as an implementation detail for now.)

I think I’d support it for any type, certainly not just raw. I like the idea of specifying the list of suffixes in the mapper, but it should be applicable to any (or a set) of types. You might be able to bind it to the extension (although that’s not really the unix way, but standards are slipping). So the top-level suffixes list would be an option, or maybe a dict:

suffixes = {
".fits" : ["", “.gz”, “.gz”],
}

but now we’re getting into implementation.

As a clever/awful/transparent hack, does the following work:

template: "raw/%(runId)s/%(object)s/%(date)s/%(filter)s/%(visit)d%(state)1s.fits{,.gz,.fz}[%(extension)d]"

This is not the general solution, I’m just curious if it works.

No, it does not work, at least not trivially, is the simple answer.

for the longer term I think it makes sense to support this in the pluggable de/serializer. (linked story).
I think this would be a short term solution: Whatever python type is declared in your policy must provide a method:
readFits(locationString, hdu, flags)
I think you could add support for that type to read compressed files?

As I said in the initial posting the fits reader already does this. The problem is that the butler doesn’t pass it the base filename (as it things it doesn’t exist)

This is the fundamental problem. Dataset existence is determined by matching the template location, and compressing a file using an external tool and changing its name means that the template location no longer matches. We need to have template locations that can match more than one pathname. Note that if that’s built into the template itself (e.g. using shell wildcard or matching syntax) it makes it difficult to use for output. Building in a suffixes list that would apply to all template locations for all dataset types seems like it might be problematic. So having an overridable dataset-type-specific method to determine existence of the dataset (with a default of doing the current matching) might be better.

I wonder if the extension could be part of the dataId? If you completed the rest of the dataId then it’s possible the extension portion of the dataId could be discovered. I think it would have to be in the registry somehow though (if you’re using an sqlite3 registry).

The set of suffixes only works because cfitsio is clever enough to read foo.fits.gz given foo.fits, so KT’s suggestion of

So having an overridable dataset-type-specific method to determine existence of the dataset (with a default of doing the current matching) might be better.

would probably be fine. I agree that external compression isn’t an ideal solution, but I think it’s here for a while.

I would think that, as regards data from specific instruments, we know what format it’s going to be (e.g. .fits.bz2 for SDSS spFrame, .fits.gz for other things), so that should just be encoded in the mapper.

Or, maybe I’m missing something here?

Is fpack compression support (or will it)?

@RHL will it work for you to have 2 dataset types that specify gzipped and not-gz’d files?
per a conversation I just had with @ktl: if not, we can hack the posix reader to see if the path name exists, and if it does: read that, and then if not look for the same path name with .gz appended. It’s not pretty, may not stick around long term, but if/when it gets removed I suppose it would be replaced with something else (policy or something) to support it.

@nidever, do the readers for your python object (AFW image or some such) support fpack? what is the filename of a file that is compressed?

We could use fpack in the longer term (i.e. not to solve the particular problem that we have files on disk)-- there were bugs in cfitsio when I looked at this long ago and I think the code bitrotted.

I’m not sure that this replaces the desire to let people compress files, but if we got it up and running it would certainly lower the priority of the request.

You mean “raw” and “raw_gz”? I don’t think that’d solve the problem. I’m looking for transparent access to the data (that’s what the butler does) independent of whether the data’s compressed.

I mean the butler’s _read method, given a location which right now your policy returns something like foo.fits would do:

if not os.path.exists(pathname): pathname = pathname + '.gz' if os.path.exists(pathname): finalItem = pythonType.readFits(pathname, hdu, flags)

(sorry, the display is removing the spaces from my “preformed text”. sheesh.)

Not quite, as cfitsio doesn’t want the “.gz” suffix.

if os.path.exists(pathname) or os.path.exists(pathname + ".gz"): 
   finalItem = pythonType.readFits(pathname, hdu, flags)