Requesting feedback on the new task documentation framework

Today I’ve shipped a new framework for documenting our tasks in the pipelines.lsst.io documentation site. I’d also like to point out the contributions along the way from @mssgill , @swinbank , and @KSK as well. I’d love for you to take a look and give your feedback in this topic thread.

This framework iterates on what we’ve done previously in Doxygen. This Sphinx-based reimagining of task documentation leverages custom Sphinx extensions (implemented in Documenteer) to automate as much of the documentation process as possible. For me, this is a fun moment because we’re finally tapping into the extensibility of Sphinx that drove our decision to adopt it.

There’s lots more that can be done to make useful task documentation. Now is a good time, I think, to show this work to the DM team and get your feedback.

Brief tour of task documentation in action

You can see the new task documentation today in the daily builds of the lsst.pipe.tasks documentation. The module homepage lists tasks with this brief bit of boilerplate:

Task reference
==============

Command-line tasks
------------------

.. lsst-cmdlinetasks::
   :root: lsst.pipe.tasks

Tasks
-----

.. lsst-tasks::
   :root: lsst.pipe.tasks
   :toctree: tasks

Configurations
--------------

.. lsst-configs::
   :root: lsst.pipe.tasks
   :toctree: configs

The task and config summaries come from the one-sentence summaries in the corresponding class docstrings.

Look at the ProcessCcdTask documentation as an example of what task documentation can look like. All the subtasks and other configuration fields are automatically documented with this boilerplate:

Retargetable subtasks
=====================

.. lsst-task-config-subtasks:: lsst.pipe.tasks.processCcd.ProcessCcdTask

Configuration fields
====================

.. lsst-task-config-fields:: lsst.pipe.tasks.processCcd.ProcessCcdTask

There are a lot of places in the ProcessCcdTask documentation that would normally be links, but currently aren’t because the corresponding API reference page isn’t available yet.

Next, take a look at the AssembleCoaddTask documentation. Content-wise it isn’t complete, but there you can see how the Python API summary section is intended to work.

With the new task documentation, one of my design goals was to move task documentation out of docstrings. We’re doing this for a couple of reasons. First, it gives us a bit more flexibility than what the numpydoc standard gives. With tasks, we’re documenting more than a single class and using class docstrings as we were was a bit of a stretch. Second, tasks will be used by more than just our API user base. For example, users on the Science Platform may fire off tasks (thanks to PipelineTask) without using the Python API. Task topic pages cater to multiple focuses, be it users of different PipelineTask activators, API users, or even general scientific documentation.

All this to say, the Python API summary section is designed as a bridge from a task topic page to the API reference page. It lets an API user quickly jump to the numpydoc-generated documentation so we can let that numpydoc documentation provide Python API-specific details like parameters, returns, and exceptions.

Documentation for task documentation

This new task documentation framework is documented in the “DM Stack” section of developer.lsst.io:

Those pages reference new templates in the github.com/lsst/templates repository. They are:

Lastly, there is some brief documentation about the new Sphinx extensions implemented in Documenteer. For example, there are lsst-task and lsst-config-field roles that link to a task or configuration field:

:lsst-task:`~lsst.pipe.tasks.processCcd.ProcessCcdTask`

:lsst-config-field:`lsst.pipe.tasks.processCcd.ProcessCcdConfig.isr`

Feedback requested

Over the next couple of weeks, I invite you to give your impressions of the current template for task documentation, and give suggestions for where we can go next with it.

Here are some starter questions I have:

  • Overall, does the template give you all the sections you need to document your tasks? Are there any common situations that could be incorporated into the template?
  • Do you like the Python API summary section? Does it strike the right balance of brevity and letting the actual API reference page do its thing with making the task topic page useful? For example, should the Python API summary section include parameters and return types for the class docstring and run method?
  • How can we document dynamic and obs package configuration overrides? I’m thinking of simply showing the code for the config class’s setDefaults method and the content from the corresponding modules in the obs package’s config directory.
  • Configurations of subtasks. Right now these task topics are designed for users to click through to subtasks to see their configuration fields. Should we include the configuration fields of (default) subtasks on the parent task’s page? Perhaps this requires a search box and JavaScript-enabled progressive disclosure.
  • Would it be useful for a task to automatically list all the known tasks that use it?

There’s also a well-known need for the following things, which I look forward to hearing your ideas about:

  • Testable, Jupyter-enabled examples.
  • Dataset documentation (ideally integrated with content in the codebase thanks to the Gen 3 middleware work).
  • Replacement of the command-line task documentation with activator documentation.

I’ll be unable to personally respond to questions and feedback for the next couple of months, but I look forward to reviewing the discussion soon. Thanks!

4 Likes

Looks great! I’ve got lots of comments below, but none of them should be considered blockers on rolling this out as broadly as possible and using it immediately; it’s already a big step in the right direction, and even more importantly I don’t think future changes (aside from those coming from PipelineTask) would change the content doc-writers will need to generate substantially.

Nothing seems to be missing at a high level. In the Butler Dataset section, we should strongly recommend that doc authors always link to other Tasks that produce input datasets and ideally some that consume output datasets.

I would also try to group the Butler Datasets with other CmdLineTask-specific sections, and possibly have separate sections for command-line usage examples and in-python usage examples.

I think I’d expand it a bit to include (duplicate, I suppose) the “Methods summary” and “Attributes summary” tables from the API docs entirely.
I also think we need to find some way to let users switch back and forth quickly between the full API doc page content and the Task doc page; bidirectional sidebar links would be a step in the right direction, but I think I might prefer just having the full API docs on the same page, at the bottom.

I also think of the Config options as being siblings of a sort with the actual class attributes and methods, in terms of the level they should appear in the hierarchy; it would be nice for those to all follow a common pattern in terms of having a summary table with more details out-of-the-way (either behind a link or at the bottom of the page). Granted, most of our config option docs don’t actually have more than the one-line summary, but we should probably adopt a one-line+detailed pattern for config docs, and they do have some structured content (types, defaults) that would already fill out a dedicated details section.

Those bottom-of-the-page detailed-doc ideas would mean relying on the sidebar for navigation a bit more, and I should note that I found a bit surprising that the hierarchy in the sidebar changes to be rooted at the current page, rather than picking out the location of the current page in the more global structure the way it does in the dev guide (I think this is common to all of pipelines.lsst.io, but I hadn’t noticed until now). I think we want the global hierarchy somewhere; my preference (which I’m happy to be talked out of) would be to have the global hierarchy in the current left-hand sidebar with a Wikipedia-style-in-the-intro or right-han-side per-page table of contents.

If we could pull it off, a "Show defaults from: " dropdown-list that would switch between the Task’s own defaults and the defaults set by a particular CmdLineTask parent and optional obs package would be really slick. (I think limiting the list to CmdLineTask parents is necessary to keep the list of options reasonably-sized). We’d then also want that to highlight somehow the options that were overridden, though; without that my slick idea is probably still less informative than your simpler one.

I think simple links are probably best here; doing this recursively would totally explode the pages, and I don’t think doing it one level deep adds much convenience.

Another note on this is that a lot of those ConfigurableField subtasks are not in practice retargetable, because there is only one implementation for a slot in the codebase and we haven’t really defined what interface a replacement would need to have. We should probably indicate that on the code side by using ConfigField instead of ConfigurableField, so to first order this isn’t a doc problem, but we should make sure that the doc system does appropriately identify ConfigField subtasks as non-retargetable but nevertheless important when if/when we do.

Yes, but it would be particularly useful to know the CmdLineTasks that indirectly use it, I think (and that would be necessary for my config defaults idea above).

I don’t think this needs to be a top priority, but it would be something to try not to rule out architecturally.

“Sure”, I guess? I’m not opposed to this, but I’m not convinced we get more out of Jupyter examples than we do regular example snippets with doctest, so if the latter are easier to implement and CI, I’d definitely focus on those first.

Yup, I expect this to eventually replace the need for hand-written content in the Butler Datasets sections…but we also don’t actually have a plan yet for where to put the source of truth for Dataset documentation in Gen3, and this is worth thinking about. The trouble is that they’re sort of multiply-owned (by consumer Tasks, producer Task, and data repos), and pretty dynamic (i.e. created on first use) in general, despite the fact that there will be standardized ones that are used the vast majority of the time.