Please document "standard" LSST DM FITS extensions

mwv · September 10, 2015, 6:50pm

David Nidever asks:

I have a general question. Not sure which room to put this in (or if it should be a discourse topic). I just ran the demo the other day and was looking at the some of the outputs in “output/sci-results/”. In particular, the “calexp” FITS files have many extensions and I was wondering if they are documented somewhere. The FITS files don’t seem to be self-documented. I could surmise that the first three extensions were image, variance, mask and then there was WCS information in another extensions. But I think there are 11 extensions in total, and for most it was unclear to me what data was in them. Is the use of the extensions in calexp consistent in the stack (e.g. extension 10 is always WCS and so on) or can the stack tell what’s in a given extension by its contents?

timj · September 10, 2015, 6:51pm

My opinion is that we will need to document the LSST data model and how it maps to FITS (and HDF5) but that I think it’s too early for us to be treating the current representation as a interface that has to be under change control.

I’ve no objection to a document describing things as they stand but I’m concerned about people then treating it as unchanging and complaining if their code breaks.

mwv · September 10, 2015, 6:54pm

A (slightly edited) summary of the HipChat thread that followed is:

@jbosch

It’s not well documented anywhere, but the extra extensions beyond the first three are not really intended for use by FITS readers beyond our own code.

@RHL

The answer is, “Use the butler” – then you never need know.

@jbosch

They’re essentially an opaque blob that contains the PSF model, a potentially better representation of the WCS, and a list of all input images if the image is a coadd. They may contain even more things in the future.

@RHL

Naturally you really want to use external code. The first 3 HDUs are as for MaskedImages and that’s probably documented (I’ll look it up). The others are things like WCS and PSF and Jim will reply…

@ktl

The first 3 HDUs are https://lsst-web.ncsa.illinois.edu/doxygen/x_masterDoxyDoc/afw_sec_image_i_o.html

@jbosch

Also, what’s in those extra extensions can change based on how the stack is configured (which PSF modeling code you use, for instance), but they do contain enough information for the stack to know what’s in; they’re self-describing in a very limited machine-readable sense, just not in any sort of human-readable sense.

@RHL

The corresponding SDSS files were the psField files. I provided a standalone C library that could be bound to e.g. IDL (it was a long time ago).

@ktl

If you need to understand this in any more detail, I think https://lsst-web.ncsa.illinois.edu/doxygen/x_masterDoxyDoc/classlsst_1_1afw_1_1image_1_1_exposure_info.html#aae90fe3f6a67c6a3ee9d68daf356c113

@RHL

This should be on discourse. I don’t think that the current situation is acceptable in the medium term – not as bad as boost persistence but along the same lines. We need to be able to tell external users how to read our datasets without installing the full LSST stack. Some of this may be simplified (analogous to writing FITS WCS as well as a full layered-distortion model), some may be via standalone binaries (e.g. returning the PSF at a point). But I don’t think, “Use the butler” is going to be acceptable.

@jbosch

I actually think the on-disk formats are not too bad in terms of making it easy to write a stand-alone reader. The challenge for writing a stand-alone reader is, by far, just writing a stand-alone Psf or Wcs class to load into.

@timj

I actually think the on-disk formats are not too bad in terms of making it easy to write a stand-alone reader. The challenge for writing a stand-alone reader is, by far, just writing a stand-alone Psf or Wcs class to load into.

@jbosch

I have to admit I’m not terribly concerned about this problem. For at least Python and C++ users I think we should address it by trying to lower the burden of installing the individual components of the stack (which was never an option for SDSS), and I don’t really care that much about IDL users at this point (I’m sort of hoping they just go away as an important population by operations).

@mwv

A individual file being useful free from an ecosystem is always useful. But, more to the point of @nidever 's question , it’s not about providing wrappers in IDL It’s finding the documentation to understand what is in the file extension.

@nidever

Yes, exactly. You can’t expect all users to only use the stack and the butler. If the documentation is there then they can figure out how to read/load the data.
Reading the extensions from IDL is trivial but I need to know what I’m looking at.
What I mean is that there are IDL tools for easily reading images/binary tables from FITS into IDL arrays/structures. But then I need to know what I’m looking at and how to use it.

@nidever

In SDSS there’s been a tradition of having a “data model” that tells you the names, directory structure and file format of all the input/outputs of a pipeline. This has been extremely useful.

@RHL

It doesn’t work for psField files – the code is too complex and you can put anything in FITS

mwv · September 10, 2015, 6:55pm

@timj
Let’s try it with no promises.
I think just including a URL string in the FITS header that pointed to the documentation would be a huge help.
ls.st it.

nidever · September 10, 2015, 7:01pm

Thanks @mwv!

Here’s the root URL APOGEE data model URL:
http://data.sdss3.org/datamodel/files/APOGEE_REDUX/
and here’s an example of the data model for one specific data product output file.
http://data.sdss3.org/datamodel/files/APOGEE_REDUX/APRED_VERS/TELESCOPE/PLATE_ID/MJD5/apVisit.html

I’m not saying this is the best it can be, but at least it attempts to document what’s in the file. It would be nice to have something like this for LSST.

timj · September 10, 2015, 7:03pm

Just to be clear, a data model is not the same as the serialization of the data model. You need to document both, preferably as distinct concepts.

jbosch · September 10, 2015, 7:16pm

I think the tension here comes from the competing goals of “document the data model” and “don’t break encapsulation and don’t repeat yourself”. While some of the stuff we put in the extra HDUs of Exposure objects is sufficiently naturally structured that it should be more obviously presented and documented (for instance, the coadd inputs tables), others - like the PSF - are just an on-disk representation of a polymorphic object, which could be one of any number of compatible plugins that conform to an interface. They could even be plugins that are defined only in a level 3 software package and hence can’t be read without it. More importantly, these are complex objects, often composed of many other objects (which are sometimes shared between parent objects), and they have a lot of internal invariants that need to be satisfied. It’s pretty much an axiom of good software design that complex objects shouldn’t expose their innards to the outside world; those are an implementation detail. And we certainly don’t want to implement those objects multiple times just to work around problems with installing the stack.

In fact, if I had to do it all over again (and I think someone has to, but probably not me), I’d have written the new FITS persistence framework for polymorphic objects that we’re using here as an extension to Boost.Serialization, rather than starting over from scratch. We started that effort in the hopes of producing a more transparent, self-documenting format, but it turned out the objects we were trying to persist really didn’t lend themselves to that (especially when it comes to persisting shared pointer links between related objects). As a result, I ended up reimplementing a lot of stuff Boost.Serialization already provided, and the new persistence framework actually makes it a lot harder to write a serializable plugin class than it used to be, with only a modest improvement in output format readability.

natelust · September 10, 2015, 9:29pm

I thought we were soon going to be starting to clean up the code for less dependance on boost, do we want to add another place we need it?

jbosch · September 10, 2015, 11:32pm

I agree with the notion that we should switch from Boost features to standard library features where there are drop-in replacements, and I share @RHL’s feeling that many of the remaining libraries are not very good. But I think there are also many remaining high-quality libraries that we might find useful (some of which we already use), and I’m not at all convinced we should try to get out of using Boost entirely. Dropping a difficult-to-build dependency would be nice if it turns out we need almost nothing from Boost after we’ve switched to the C++11 standard library and cleaned up our own code, but that remains to be seen.

mwv · September 11, 2015, 3:08am

I think there’s a lot to be gained with a few hours of documenting the stuff that can be documented. One might hope that one early return path of this discussion could be to do these easy things.

For the rest: “Serialization of PSF model 3.4 as written by libAwesomePSF SHA1” would be totally fine as a medium-term improvement.