Documenting internal designs and implementations (following FITS data model discussion from HipChat)

jsick · September 14, 2016, 9:48pm

This post is just to summarize and give follow-up on a discussion this morning (2016-09-14) about the best way to document the FITS data model (see DM-4621).

My main take-aways from that discussion are, to roughly paraphrase:

@jbosch: The FITS data model is something that we need to document internally for our own development, but it isn’t a stable public API. (All FITS interaction should currently be done through Stack code.)
@timj noted that this is effectively an interface control document (ICD). However, @ktl and others pointed out that making this an ICD (and thus an LDM) would put too much process burden on both the TCT and developers.
There’s a desire to keep this document versioned with the code. @jbosch suggests including the document in the code repo (that is, afw). There was some discussion about whether this meant the document was part of the afw user documentation (i.e., content in the doc/ directory). @timj wants to ensure that this document is directly citeable, rather than being a page in the user guide.
Working consensus seemed to be that the document should be a DM Technote (DMTN) and that it should be tagged with code versions.
There was also a desire to separate the abstract data model from format-specific concerns (to enable both FITS and HDF5 serializations, for example).

(The HipChat participants can tell me if I didn’t capture the ideas correctly.)

I’m in the middle of drafting LDM-493: Data Management Documentation Architecture (no link yet) and its purpose will be to streamline discussions like this of ‘how do I document this?’

The following opinions thus reflect what I’m writing up in LDM-493, though of course they’re not change-controlled yet. So this discussion is a great opportunity to test LDM-493.

The doc/ directories of stack packages contribute to the Science Pipelines User Guide. User Guides are a class of documentation that we write primarily for our end-users (astronomers using LSST software and data, though there can also be user guides for internal customers; the DM Developer Guide is an example). Because User Guides are written specifically for users, we don’t want extraneous design and architectural documentation in the guides. Given that the FITS data model is currently a private API, we shouldn’t be including this information in the User Guide. Once it becomes a public API, the FITS data model should be part of the User Guide.
We absolutely need to document our designs and architectures to enable efficient internal development. Design Documents (LDMs) are formal documentation of DM designs. As others also noted in the HipChat conversation, I’ve noticed that there’s hesitation to document too much in the LDM series because of the process overhead. Thus in LDM-493 I’m proposing a class of document called an Implementation Technical Note. These are just like Technotes in every way except:
1. Implementation Technical Notes have metadata that traces to an LDM design document. I.e., an Implementation Technical Note is a document written and maintained at a level directly relevant to DM developers, yet exists within the scope agreed upon in an LDM document.
2. Unlike regular Technical Notes, Implementation Technical Notes are expected to be maintained with the reality of the code base. For example, an Implementation Technical Note can start out as a design proposal. Once code is implemented and the original design changes, the Implementation Technical Note should be updated.

Implementation Technical Notes vs User Guides

Implementation Technical Notes are different from User Guides in that User Guides document how the public API should be used, whereas Implementation Technical Notes document the design and implementation details for developers. The two sides of documentation can link between each other. As our codebase matures and less development is actively occurring, information will naturally migrate into the user guide.

Implementation Technical Notes vs User Guides

Regular Technical Notes exist to document design proposals, experiments, and investigations not directly connected to the implementation of the Data Management System itself.

Documentation Architecture Visualized

The following diagram from the LDM-493 draft shows how these classes of documents fit together from an information flow perspective. Requirements flow into design documents and implementation technical notes for developers, which then flow to consumers in user guides.

An Implementation Technical Note for the FITS data model

To answer @rowen’s original question, my opinion is that this FITS data model document should be an Implementation Technical Note describing the data model from a developer perspective to enable internal DM collaboration.

This technote can be made with the regular lsst-technote-bootstrap template. Since Implementation Technical Notes don’t formally exist, per se, the metadata treatment for them does not exist yet. This technote will have a DMTN serial number, just like other technotes.

There are two options for where to put such a technote:

In its own repo, as we do for other technotes, or
In a directory inside the code (afw) repo, which makes it easier to connect the documentation to code in afw.

We’ve never done #2 before, but something like this has been requested by DESC. The caveats are:

We’d need to add a .travis.yml to the root of afw to enable Travis-based documentation builds.
We’d need to add a technotes/ directory to the root of afw to host such implementation technotes.
I’d need to adapt ltd-mason to only build the technote when the technote itself has changed, not for every ticket branch.

There was some discussion on HipChat about doing #1, but making the technote a Git submodule in afw. I’m not sure this would work since its only the branches and tags in the technote repo itself that matter to LSST the Docs. If we do #1, I think we’d just tag the repo with each version of the code base, and use branches from afw tickets (i.e., an afw ticket that touches the FITS data model must update both the technote repo and afw itself.).

I welcome feedback on whether #1 or #2 is preferable.

Of course, I also welcome feedback on the concept of Implementation Technical Notes as a class of DM documentation.

jbosch · September 15, 2016, 5:03pm

I’m not sure I really think of this documentation as “Implementation Tech Notes” - or at least not all of it - for a couple of reasons:

I don’t think we have just a single level of public interfaces, and while code at the very lowest levels is strictly implementation, there are a lot of semi-public and protected interfaces between that and the higher-level interfaces that are more obviously public - and that makes it very hard to define where the User Guide ought to end and implementation documentation would begin.
I think of technotes as mostly existing as discrete, stand-alone documents. But documenting design decisions is just like documenting public APIs - it’s ultimately all connected (even if it won’t be until our docs are in a much better state than they are now). While there may be some design topics that are stand-alone enough (or global-enough) to merit pages that aren’t specific to some part of the codebase, I think that’s similarly true for public-level User Guide topics.

That makes me wonder if we should handle implementation docs via a suite of special reST markup blocks, e.g.:

A way to add an implementation note to a user-guide or reference documentation page that would perhaps normally be hidden or otherwise de-emphasized.
A way to mark larger documents as discussing design decisions or implementation details (perhaps at multiple levels), which would appear in different sections of the table of contents.

There may not be that much space between this proposal and your Implementation Tech Notes once the tech notes are integrated into a cross-linked documentation index, but in either case I think it’s fairly critical that the implementation docs be versioned and released along with the code (which I don’t think is obviously the case for regular tech notes), and to me that suggests they should go in the actual code repos, at least most of the time.

All of that said, I don’t actually think all the data model stuff that sparked this discussion actually does belong in implementation docs. The stuff that is relatively stable and isn’t algorithm-dependent - i.e. the stuff that mostly goes in FITS headers - should be considered a public API, and it’d be quite fair for it to go in the main User Guide (and until that’s really up and running, a Tech Note may be best in the interim to avoid having to reformat it later). The formats I don’t consider a public API - the extra HDUs that serialize more complex objects - I think should actually not be documented directly at all, because that format is largely machine-generated; a user-visible class like PsfExPsf has methods that transform its data to a set of tables (which can call serialization methods of other classes they own recursively), and the Exposure that we see is the result of the serialization framework calling those methods on all of its constituent objects, merging tables with the same schema, and writing the tables out to FITS. The set of tables corresponding to each serializable class should be documented. But I’m not convinced we need more than code comments for those: they’re really the equivalent of a paragraph or two each, and they should be geared towards a developer working on our serialization code, not someone external trying to write a different implementation of it. And we should have an implementation design document for how the serialization framework works. But we absolutely should not try to reconstruct in document form a full Exposure object as generated by executing the serialization framework on an Exposure; not only is that a waste of developer time, it’ll be fragile to changes in the derived classes of the Psf and Wcs objects we actually put in a particular Exposure.

timj · September 15, 2016, 5:15pm

It’s pretty much the definition of an Interface Control Document. The user guide should link to the ICD but it is a public interface and not just user documentation. In all my previous telescopes the contents of the data products have been defined in ICDs. It’s a critical document that is explicitly associated with the data archive (and at CADC the FITS headers tell you where to find the document describing the data: which should be a DOI for documented version).

I would be very disappointed to our future data curation experts if the LSST final data archive was full of files containing opaque undocumented blobs. One of the purported reasons to use FITS is that the data format can easily be understood by future archive researchers even when Python and C are no longer common languages.

jbosch · September 15, 2016, 5:33pm

Once we know the algorithms and the exact models we’ll be running, I think it might make sense to define a more publicly readable format for saving them, and use that instead of our current approach, which is conceptually more like a binary pickle file than what most of us think of as FITS.

But even then, I can easily imagine that some of our objects - such as CoaddPsf objects composed of the composed-Wcss and wavefront-space Psfs of each of the input images that went into a coadd - will be sufficiently complex that that’s simply not worth it, and hence we’re better off writing small utilities (possibly in additional languages) to allow users to interpret them, if our code isn’t modular enough to do that itself with little overhead by then. That’s what SDSS did for PSF models and its equivalent of HeavyFootprints (which were written to FITS as opaque blobs).

In any case, we’ll always need to have the generic serialization format as an option, so if someone implements a new Psf that plugins into the stack they won’t have to rewrite all of Exposure's persistence to be able to use it.

I wonder if anyone with a HEP background has thoughts on this - I think of HEP data products as being typically much more complex than the standard astro ones, but they must have similar requirements on documenting and distributing things.

jsick · September 15, 2016, 6:34pm

This conversation is bringing up a lot of valid points that are making me re-think my idealized doc taxonomy for LDM-493. Here are some initial thoughts (note this isn’t specific to the issue of documenting the FITS data model, this is more generic thinking about the role of internal docs vs public docs).

jbosch:

I don’t think we have just a single level of public interfaces, and while code at the very lowest levels is strictly implementation, there are a lot of semi-public and protected interfaces between that and the higher-level interfaces that are more obviously public - and that makes it very hard to define where the User Guide ought to end and implementation documentation would begin.

I think of technotes as mostly existing as discrete, stand-alone documents. But documenting design decisions is just like documenting public APIs - it’s ultimately all connected (even if it won’t be until our docs are in a much better state than they are now). While there may be some design topics that are stand-alone enough (or global-enough) to merit pages that aren’t specific to some part of the codebase, I think that’s similarly true for public-level User Guide topics.

You’re right, I don’t want to have duplication between internal documentation (LDM/DMTN) and what’s published in the User Guide.

I think we might naturally see ideas written initially in LDMs or DMTNs (when the implementation doesn’t exist, or we don’t know how to present the information to users). Once those implementations mature, and the APIs become publicly usable, information will likely migrate from the DMTN to a page in the User Guide for the specific package.

There are a couple ways to make this migration happen:

The DMTN is ‘hollowed out’ with original content being replaced by links into documentation in the User Guide. The DMTN could even be deprecated altogether. The trick here is making the ICD citeable, which is a valid point that @timj makes. I wonder if a DOI can point into a single page of a user guide?
The DMTN and User Guide partially cover the same information (as appropriate, with the former taking an internal design perspective, and the latter taking the public user perspective). Overlapping information for both docs, like tables, would originate from the same file, and use Sphinx extensions to pull in and splice common source files. Sharing the same source information ensures both documents are current, while speaking to their own intended audience.

Note that I’m open to having Implementation Tech Notes be co-located in the Git repositories of packages, which would:

Ensure that the docs are versioned in lock step with the code.
Make the code available for introspection to automatically generate documentation, when feasible.

Another possibility is that we put more developer-oriented documentation into a ‘Developer Guide’ section of the package API reference docs (i.e., content that resides in /doc), as you suggest:

This would obviate many DMTNs. The downsides are that we’d have to be careful to not confuse and clutter the end-user documentation. There’s still the issue that these would be less citeable, and these designs would disappear from DocHub browsing.

The cases where the proposed Implementation Technotes are really useful are for:

Designs and proposals for code that doesn’t exist yet (see all the design pages linked from https://confluence.lsstcorp.org/pages/viewpage.action?pageId=2065008 for prime Implementation Technote candidates)
To document systems that don’t exist as a single Git repo or project. SQR-006 is a good example of this. That document describes the purpose, philosophy and cohesive design of LSST the Docs. I still have user guides for individual components, like ltd-mason and ltd-keeper. The technote is written to explain why LSST the Docs exists, while the user guides describe how to use it.

That’s a good point. No use in me re-inventing patterns. @cwalter, do you have any pointers here? I’d be interested to see what LHC does here.

jsick · September 15, 2016, 7:39pm

Another way of looking at Implementation Technote vs User Guide is this paragraph:

jbosch:

All of that said, I don’t actually think all the data model stuff that sparked this discussion actually does belong in implementation docs. The stuff that is relatively stable and isn’t algorithm-dependent - i.e. the stuff that mostly goes in FITS headers - should be considered a public API, and it’d be quite fair for it to go in the main User Guide (and until that’s really up and running, a Tech Note may be best in the interim to avoid having to reformat it later). The formats I don’t consider a public API - the extra HDUs that serialize more complex objects - I think should actually not be documented directly at all, because that format is largely machine-generated; a user-visible class like PsfExPsf has methods that transform its data to a set of tables (which can call serialization methods of other classes they own recursively), and the Exposure that we see is the result of the serialization framework calling those methods on all of its constituent objects, merging tables with the same schema, and writing the tables out to FITS. The set of tables corresponding to each serializable class should be documented. But I’m not convinced we need more than code comments for those: they’re really the equivalent of a paragraph or two each, and they should be geared towards a developer working on our serialization code, not someone external trying to write a different implementation of it. And we should have an implementation design document for how the serialization framework works. But we absolutely should not try to reconstruct in document form a full Exposure object as generated by executing the serialization framework on an Exposure; not only is that a waste of developer time, it’ll be fragile to changes in the derived classes of the Psf and Wcs objects we actually put in a particular Exposure.

This paragraph, once expanded slightly, is essentially an Implementation Technote stating, to the audience of DM Developers, what our chosen design is for using FITS files. An Implementation Technote here can summarize the relevant public APIs and data model keywords and how we’ve chosen to make some of our serializations opaque and interfaced only through the specified APIs. Otherwise, where else have we written down our decision to make some of our FITS data model a black box?

On the other hand, the User Guide will be written specifically for users and won’t even need to mention the debate over why certain aspects of the data model are accessible only through Stack APIs. Instead, the User Guide would just decisively show how to access PSF models.

In other words, Implementation Tech notes are where we write down the strategies, debates, and decisions. User Guides are where we teach how to use the Stack as implemented.

Is this distinction clearer?

jbosch · September 15, 2016, 8:04pm

The distinction you’re making is clear, but I’m still having a hard time mapping it neatly onto our codebase. That paragraph actually provides an example of that too:

Many DM developers will want to know how to write a new class that uses the serialization framework - many more than will be interested in knowing how the serialization framework works, or understanding the design decisions behind that interface. To most DM developers, then, I think the documentation for the serialization framework’s plugin API is essentially a User Guide - but not one focused at public users (or only focused on a very small group of power-users who might as well be considered DM developers). That’s distinct from an implementation document describing why the serialization framework works the way it does, or how it actually maps the in-memory representation it gets back from specific classes to a format like FITS.

jsick · September 15, 2016, 8:43pm

Yes, I agree with you. Documentation on how to write a new class that uses the serialization framework would certainly be in the User Guide. In this case, the API consumers are mostly DM developers. I suspect this is quite common.

I don’t think we will have DMTNs for every design decision, and we’ll want to prioritize effort on User Documentation, but once the docs diverging from discussion ‘what a user should know to use this and get work done’ and instead talks about ‘why this exists this way,’ that’s a good indication that an over-arching technote could be useful in my mind.

ktl · September 15, 2016, 9:18pm

In this discussion we should remain conscious of the spectrum of “users” who may need guiding and documentation and assistance. While many tools have “pure users” who use the tool as distributed without modifying it in any way, I think that our code, as more of a toolkit than a tool, will have a much smaller percentage of such users and instead will tend to have substantial numbers of “reassemblers/reconfigurers” who take the parts and put them together in different ways using their internal interfaces, “patchers/tweakers” who substitute individual components with slightly or substantially modified replacements, and “developers” who may be as sophisticated in algorithms and programming as project staff. So “internal design and implementation” documentation may not be as separate from “user documentation” as some of the above comments make it seem.

jsick · September 15, 2016, 10:07pm

Thanks @ktl and @jbosch. It sounds like we should be pushing everything, from designs architectures to practical user guides into the Science Pipelines User Guide and do away with the idea of Implementation Technical Notes for specific systems and interfaces. I think I’m coming around to this idea.

Given this idea, and for the sake of me writing this up in LDM-493, let me try to draft the workflow to go from a proposal for a new system to documentation for an implemented, functioning system:

In the beginning, a system exists only an idea/set of user stories/design. Something like https://confluence.lsstcorp.org/display/DM/SuperTask+and+Activator+high-level+overview to give an example as a design for SuperTasks and Activators or https://ldm-151.lsst.io/v/draft/blended-measurement.pdf as an example of a design proposal for blended measurement. Initially the code for this won’t exist, and we aren’t sure where the code will exist. So the only place to write documentation is in a stand alone Technote. Thus, all designs are initially documented as technotes.
The code is being implemented. Now design content from the technote is adapted into an architectural description in the user guide, along with new content like tutorials, API references, and so on. These implemented parts of the system are removed from the design technote, and replaced with links to the new documentation in the User Guide.
When the system is fully implemented, the design technote should be marked as deprecated, and replaced with links to documentation in the User Guide for the implemented system.

This workflow allows us to have a well-organized (thanks to DocHub) set of design documentation, that doesn’t duplicate the User Guide, and doesn’t allow the design documentation to go stale once the system is implemented and being maintained.

Does this sound like a reasonable workflow?

timj · September 15, 2016, 10:15pm

How do we retain citability though? The tech notes are easy to cite (author, handle, title, version, and possibly DOI); a page in a user guide is not. “Section titled ‘FITS header’ in AFW docs” is not a way to reference something. Are you issuing DOIs for each section in each user guide?

jsick · September 15, 2016, 10:35pm

That I don’t know yet. I’ll have to find out if DOIs can be issued for resources that sit inside larger resources that themselves have a DOI. I.e., if each section and/or page of the Guide can be issued a DOI with each release, which itself has a DOI.

That said, I’m not sure what this will mean for DocHub. I don’t imagine individual User Guide pages being discoverable by browsing DocHub in the same way that Technotes and User Guides themselves will be.

The overall feedback I’m getting is that I should prioritize the developer doc contribution experience by eliminating any possible ambiguity and overlap between ‘documenting a system’ and ‘documenting for the user of a system.’ I’m willing to accept this as a fine priority; even if it means extra work on the documentation engineering side to make this citeable/discoverable/traceable.

jbosch · September 16, 2016, 2:45pm

I don’t think we want to go so far as to say “all designs”. In a lot of ways, Confluence is a better tool to use when a design is in a draft-and-comment stage, and sometimes a commented, git-controlled source code file is a better way to prototype an API. If those are turned into implementations relatively quickly, I don’t think it’ll always be worthwhile to convert those into technotes in the interim. I would support converting those formats into technotes if the implementation isn’t scheduled immediately, and I’d be comfortable leaving it up to the technical managers as to when to do that (rather than having an explicit policy).

I very much agree with the rest of your proposed workflow.

I’m genuinely curious what the motivation is behind the focus on DOIs, having never made use of them myself. Is this just to encourage citations of our work? If so, I’d have thought we’d be pointing people at published papers for that. If it’s to provide a persistent link to a particular section of our documentation, why doesn’t a versioned web URL suffice?

timj · September 16, 2016, 3:22pm

because a few years after LSST is shut down the links won’t work and people won’t be able to find the document that contained the data model. As someone who has already been through a telescope shutdown and realising that every single “mytelescope.edu” link that ever appeared in an ApJ or MNRAS article has stopped working (modulo archive.org) I am very sensitive to arguments involving referencing documentation via “transient” locators.

jbosch · September 16, 2016, 3:27pm

Do DOIs also provide some sort of persistent home for the documentation itself, then, that would remain even if the web site that originally hosted it is taken down? Or are you assuming that the documentation would end up getting hosted at a different address and we could then modify the DOIs to point to the new location?

timj · September 16, 2016, 3:40pm

A DOI is really a URL redirect with a promise that, in theory, it will always redirect to the document that was given the DOI. Issuing your own DOIs is possible but NSF-funded entities tend to find it hard to fulfill the longevity promise. We currently use Zenodo as our DOI issuer which requires we upload the files to Zenodo when we “mint” the content. Even without DOIs, citing a section of some big guide seems wrong to me. I don’t really understand why the guide can’t itself link to a standalone document.

I’d like to use DOIs to handle our query reproducibility requirement (rather than people putting the query strings in their papers, they put the DOI and that takes them to the query with a link to the results of the query).

jbosch · September 16, 2016, 3:46pm

Got it; thanks for the explanation. Certainly sounds like a worthy goal, though still I’m a bit worried about the tension between a desire for standalone documents for guaranteed persistence and a fully-integrated, heavily-linked documentation site for better navigation and discovery; hopefully @jsick can indeed find technical solutions that don’t force us to choose between them.

timj · September 16, 2016, 3:46pm

Sorry, just to add, in theory we could sign up to DataCite ourself and issue our own DOIs and then promise to handle the archiving of those endpoints and updating the DOI redirects when we lose funding. The problem is that (1) DataCite usually charge per DOI issued and (2) when money is tight when a telescope is closing down, the last thing management care about is ensuring that all the DOIs we’ve issued will be properly handled and handed off to a archival institution.

If we were going down the route of issuing our own DOIs it might make sense to collaborate with UofA to see if they would take on the longevity requirement.

jsick · September 16, 2016, 4:58pm

So, a goal a LDM-493 is to have an explicit policy but only in the sense of making the documentation workflow something both more effective and less ambiguous than DM design and user documentation currently is.

The situation right now is that we’re seeing a lot of DM design information end up in Word documents, Google documents, Confluence pages, LaTeX documents and of course Sphinx technotes on lsst.io (which is understandable because no one has said how design documentation should be delivered).

In effect, DM is closing tickets on design stories that aren’t delivering a durable design product to the Project. It’s only through @timj’s fantastic efforts in the dm-highlights that most of the DM team learns about these designs. And though I’m not part of the review process, I suspect that this heterogeneity in documentation is making reviews harder for the leadership team to prepare for.

SQuaRE’s motivation to build DocHub is to collect all LSST documentation under one system. With DocHub, documentation delivery will be unambiguous. If a document has been delivered, it’ll be registered and available through DocHub.

This does imply a common denominator of technologies. The document must be durably archived and versioned by something like Docushare or GitHub, have DocHub metadata, and be available as a static website through lsst.io.

I recognize that Google Docs and Confluence are effective for collaboratively drafting documents, and I don’t want to get in the way of that. My current LDM-493 proposal is that once these documents are ‘delivered’ because a design ticket was closed or equivalent, then a snapshot (usually a PDF) of that Google Doc or Confluence page will be taken. That snapshot is treated as a technote. It’ll be stored and versioned on GitHub, have document metadata, be registered on DocHub and published on lsst.io.

These PDF-based technotes will look something like the pages from gh-publisher: metadata will be displayed alongside the PDF document. These landing pages will also support LaTeX documents and collections of Jupyter notebooks. SQuaRE will provide automations to make documentation delivery as unobtrusive to developer workflows as possible.

I think this will give us the best of both worlds: convenient drafting in the platform of your choice, along with standardized delivery.

(Sphinx will remain as a first-class technote format.)