FITS tile compression in the LSST stack
The recent increase in the number of mask planes caused a corresponding increase in the file sizes, as we went from 10 bytes per pixel to 12 bytes per pixel. To offset this, the recently-merged DM-11332 introduces support for FITS tile compression. Note, however, that compression has NOT yet been activated by default, as we're waiting for approval of RFC-378.
FITS tile compression is distinct from simply applying common *nix compression utilities (e.g., gzip, bzip2) to the FITS file. The latter causes the entire file to be compressed, meaning that reading the header or a subimage require uncompressing the entire file. In contrast FITS tile compression leaves the headers (relatively) untouched, and compresses the pixels only; moreover, the pixels are grouped into subregions ("tiles") before compression, allowing quick reading of subimages. FITS tile compression is supported by most modern FITS readers, including cfitsio, pyfits and ds9. If you desire to use a FITS reader that doesn't support FITS tile compression, you should be able to use the
funpack utility in cfitsio to uncompress a file and then read it. For more details on FITS tile compression, see 2009PASP..121..414P.
FITS tile compression may be either lossless or lossy, depending on the parameters chosen. Lossless compression can be performed on integer images with good compression factors (e.g., 4--10, depending on the distribution of values in the image), but floating-point images do not typically yield good compression factors (e.g., 1.1) when compressed losslessly. Lossy compression (typically applied to floating-point images) involves first quantizing the image, and then compressing the resultant integer image. Though the term "lossy" might be scary (are we flushing away all our hard work?!), Bill Pence (author of cfitsio) et al. have demonstrated (2010PASP..122.1065P) that, with appropriate parameter choices, lossy compression can achieve significant compression factors while preserving the scientific information. The Pan-STARRS Image Processing Pipeline routinely writes its reduced images with lossy compression, with no known ill effects. Clearly the use of lossy compression in the LSST stack needs to be investigated (a working group is being chartered; DM-11819) before being activated across the board, but once activated it would grant a significant gain in the storage sizing model. Regardless of the compression settings, pickling of images is still done without compression: we assume that keeping the computational burden down is more important than savings in network transfer (and we don't normally pickle large images anyway). Compression can be disabled globally if desired. In addition to the low-level interface for writing FITS images in afw, there's a high-level interface that allows butler datasets to be configured with appropriate compression.
RFC-378 seeks approval to make lossless FITS tile compression the default for writing all flavors of image (
Exposure). This would yield a minor space gain for floating-point images, but a big gain for masks. Those with opinions are encouraged to follow and/or contribute to the discussion.
Parameters are separated according to whether they are for compression or scaling. Scaling (a.k.a. quantization) is used to convert a floating-point image to an integer image, which allows higher compression factors at the expense of the loss of some information (typically unimportant information).
The compression options are used exclusively by cfitsio. They are:
algorithm: the compression scheme (what algorithm is used to do the actual compression). Options are:
NONE: no compression.
GZIP: standard GZIP compression.
GZIP_SHUFFLE: GZIP compression with shuffle (most-significant byte first); this is the default.
RICE: RICE compression.
PLIO: IRAF PLIO compression of masks; this is not generally appropriate, as it has a 24-bit limit, while our masks are 32 bit.
columns: number of columns in each tile (default is
0, which means the entire dimension).
rows: number of rows in each tile (default is
quantizeLevel: quantization applied by cfitsio to floating-point images. The FITS
BSCALE is chosen to be the standard deviation (note: cfitsio doesn't have access to our masks when it measures the standard deviation) divided by the
0 for lossless compression (this is the default).
The scaling options are used exclusively by our code in order to provide cfitsio an integer image which may be compressed. This provides more scaling options than cfitsio does, respects our own masks for statistics, and is extensible in case we want to add more scaling options later (e.g., logarithmic or asinh scaling). Options are:
algorithm: the algorithm used to determining the scales (and therefore the dynamic range). Options are:
NONE: no scaling at all; this is the default.
RANGE: scale based on the dynamic range in the image. This preserves the full dynamic range at the cost of sensitivity to low-level fluctuations.
STDEV_POSITIVE: the scale is set to sample the standard deviation of the image (
BSCALE = stdev/quantizeLevel) and if the entire dynamic range cannot fit then choose that it extends principally positive (allowing only
quantizePad standard deviations negative).
STDEV_NEGATIVE: similar to
STDEV_POSITIVE, but the fallback dynamic range extends principally negative.
STDEV_BOTH: the scale is set similar to
STDEV_POSITIVE, but the fallback dynamic range is shared between the positive and negative sides.
MANUAL: the scale is set manually.
bitpix: the bits per pixel, which may be one of
0,8,16,32,64,-32,-64 (negative means floating-point but you probably don't want to use those here;
0 means to use the
bitpix appropriate for the image, and this is the default); larger values give more dynamic range and produce larger images (which effect may be negated by compression).
fuzz: whether to add a random field of values distributed on
[0,1) before quantizing; this preserves the expectation value of the floating-point image, while increasing the variance by 1/12. The random values are removed on read (we record the seed in the header) by cfitsio.
seed: seed for the random number generator used when we
maskPlanes: a list of mask planes to be ignored when doing statistics; for the
quantizeLevel: for the
scheme=STDEV_* algorithms, specifies the ratio of the quantum size to the image standard deviation (default is 4).
quantizePad: for the
STDEV_NEGATIVE algorithms, specifies how many standard deviations to allow on the short side (default is 5).
bzero: for the
scheme=MANUAL algorithm, specifies the
BZERO to use.
High-level (butler) interface
Butler datasets may be assigned a
recipe name (which defaults to
default). This is a symbolic name which links to a set of
scaling options for
variance elements, according to the storage type (the only type that currently supports recipes is
FitsStorage, the appropriate type for FITS images). The default recipes are defined in
obs_base/policy/writeRecipes.yaml, and overrides may be specified in
obs_whatever/policy/writeRecipes.yaml. As an example, here's
# FITS images
# Default: no compression
# Lossless compression
# Basic lossy (quantizing) compression
To activate compression for a dataset, you just add a pointer to the desired recipe in your
WhateverMapper.yaml policy file, e.g., I could put in
Low-level (afw) interface
Compression is controlled using the
lsst.afw.fits.ImageWriteOptions class, which is composed of separate
ImageCompressionOptions which contains the compression options) and
ImageScalingOptions which contains the scaling options) elements. All image types now have
writeFits methods that can receive an
ImageWriteOptions instance (or, in the case of
Exposure, three of them -- one for each of the image, mask and variance planes), but all the old
writeFits overloads still work as they did before. When writing images through the high-level butler/persistence interface (using the formatters), the write options are specified in a
Compression can be disabled globally in Python:
or using a context manager: