Documentation and links for Python-wrapped C++ code in Sphinx

jbosch · November 10, 2016, 5:02pm

As we start to write more restructured text documentation, I think we need to determine our policy (and perhaps add some tooling) for how to refer to code objects (classes, functions, etc) that exist in both C++ and Python, since Sphinx restructured text provides different syntaxes for those links.

Here’s a proposal to start us off:

When documenting Python code, always refer to other code objects using Python syntax (dotted namespaces) and :py: directives, even if those code objects are originally defined in C++.
When documenting C++ code (I’m not sure we’d actually ever do this if we continue to use Doxygen), refer to to other code objects using C++ syntax (double-colon namespaces) and :cpp: directives.
When writing general documentation that isn’t language-specific, use the Python syntax and directives.

I also think we should either have a custom directive for linking a C++ code object to its Python counterpart, and possibly try to make those connections automatically. It might also be nice to have the documentation for both languages on the same page, allowing text that’s relevant for both languages to appear only once. Last I heard, we were planning to go with Breathe to bridge Doxygen and Sphinx. If it already has a sane approach to this, it’d probably be best to just go with that instead of fighting it. @jsick, do you know if whatever it does is already consistent with my above recommendations?

timj · November 10, 2016, 5:15pm

We have a scheme in place for referring to LSST python functions and methods using a modification of the :mod: and :class: syntax. They currently don’t go anywhere because they can only link to other rst files but we did start using it in tech notes and the developer guide (until @parejkoj started backing out the changes to make them link to doxygen docs).

parejkoj · November 10, 2016, 6:09pm

That was so people could find the docs for those methods. I’d love for the changes I made there to be reverted and have the links go to sphinx docs.

jbosch · November 10, 2016, 6:23pm

I think I’m aware of that functionality for links (they’re what I linked to in the original post, right?). What I’m trying to add here are recommendations for when to use the Python links vs. the C++ links when the code object exists in both language.

timj · November 10, 2016, 6:29pm

I don’t think so. We define LSST variants such as :lmod: and :lclass: that will know how to find the LSST documentation.

jbosch · November 10, 2016, 6:52pm

Ah! So should we be using those instead of e.g. :py:class: now when writing numpydoc docstrings?

(If so, we should upgrade the dev guide.)

timj · November 10, 2016, 7:25pm

I’ll leave that one for @jsick to answer.

jsick · November 10, 2016, 7:51pm

This is right. “General documentation,” like conceptual topics and tutorials, should nearly always take a Python-centric perspective. This matches the LSST science community’s needs. I think we can treat C++ as an implementation detail of the Stack when we teach it through conceptual docs and tutorials (although we’ll make the C++ API known through reference documentation).

This is right. API reference documentation written in Python modules (also known as numpydoc-formatted docstrings) will only refer to other Python APIs and Python concepts. Types of parameters and return types will cross-reference to Python APIs. Example code in numpydoc docstrings is always Python code.

In other words, a person should be able to read the LSST Science Pipeline’s/Stack’s documentation and never be distracted by C++ details.

Now it turns out that we can take advantage of Sphinx’s ‘default’ domain. This means that in conceptual documentation, and certainly in numpydoc, we can treat the default domain as python. So instead of making a reference with

:py:mod:`lsst.afw`

we can simply write

`lsst.afw`

and Sphinx will make the appropriate cross-link. My validate_base package shows this, see https://github.com/lsst/validate_base/blob/tickets/DM-7935/python/lsst/validate/base/metric.py.

So… the “l” roles were my original approach to preparing crosslinks for code that’s not yet covered in the new Science Pipelines sphinx project. However what I’m finding lately in practice is that the default domain and default role (single backticks) is very effective. There’s no need to fully specify the role, like :py:class: anymore.

tl;dr I think the emerging recommendation is that interim API references should use the double-back tick inline code syntax. Then we’ll just convert that to single back ticks once the APIs are documented in Sphinx.

In other words, @parejkoj’s fine. I don’t think we need :lclass: anymore.

I also agree in principle, but in practice C++ API references will be written in doxygen and bridged into Sphinx with breathe. I think it’ll be rare for us to refer to C++ APIs in tutorials and conceptual documentation, but we can use :cpp: from reStructuredText for that.

Can you give me a scenario where this directive would be useful?

I’m not sure, but the quoted passage might be mixing up the authoring context and the generated output context. So I’ll backtrack and talk about both separately.

In terms of the output context, I’ll push back on putting the C++ API reference on the same HTML page as the Python API reference. I think this will be too confusing. It’ll be better to have separate Python and C++ API reference sections for each package. The API reference pages will include ‘see also’ cross-links that will allow someone to jump from the Python version of an API to the C++ version of the API. But again, I think we should strive to allow Python API users to be able to not see any mention of C++ in documentation. Note that this link between C++ and Python API reference pages will be custom work; I’m not sure I know how this will be done yet. I’ll especially have to study what pybind11 means for this.

Then there’s the authoring context. Here, I think you’re right: it’d be ideal to document a C++ API in the C++ header itself, and then transform that C++ reference automatically into a Python API reference. This is consistent with keeping API reference documentation where the code is, and eliminating duplicate content.

At the same time, I’m not entirely sure how or if this will work. Some challenges:

When transforming API reference links from the C++ Doxygen content into Python numpydoc content, it assumes there’s a 1:1 mapping from C++ API objects to Python API objects.
We’ll have to introduce syntax into Doxygen that allows us to write C++ and Python examples that only appear in the appropriate output context.
I think pybind11 gives us a lot of flexibility to reshape the Python API in a way that’s not visible from the C++ header and Doxygen. It might be inevitable that some Python API documentation may need to be hand-crafted at the pybind11 level.
Whatever we do, we need to be able to embed numpydoc docstrings into Python by runtime, not just in the Sphinx build, so that Python users can get inline documentation.

So I’m not giving up on the principle of documenting our C++ APIs only in C++ and automatically transforming that into Python… but, I’m not sure it’ll work and it may be impossible to do correctly. I’ll know more when I start actually working on this. I do know that ‘industry’ would just hand-craft API reference documentation for both contexts since this gives the best user experience.

kfindeisen · November 10, 2016, 8:16pm

[quote=“jsick, post:8, topic:1392, full:true”]
In terms of the output context, I’ll push back on putting the C++ API reference on the same HTML page as the Python API reference. I think this will be too confusing. It’ll be better to have separate Python and C++ API reference sections for each package. The API reference pages will include ‘see also’ cross-links that will allow someone to jump from the Python version of an API to the C++ version of the API. But again, I think we should strive to allow Python API users to be able to not see any mention of C++ in documentation. Note that this link between C++ and Python API reference pages will be custom work; I’m not sure I know how this will be done yet. I’ll especially have to study what pybind11 means for this.[/quote]
A presentation style I’ve seen in some cross-language libraries (e.g., Unity) is a drop-down, button, or other selector that lets the user say what language they want to see documentation in. I don’t think this is any more difficult than the original proposal (in that you still need to modify the Doxygen/Sphinx output somehow).

[quote=“jsick, post:8, topic:1392, full:true”]
I think pybind11 gives us a lot of flexibility to reshape the Python API in a way that’s not visible from the C++ header and Doxygen. It might be inevitable that some Python API documentation may need to be hand-crafted at the pybind11 level.[/quote]
While I agree this will be sometimes necessary, I think we do need a “default” way to automatically translate C++ to Pyhon documentation in cases where the API is the same in both languages. Requiring developers to write and maintain documentation independently in both languages will likely lead to either less documentation or inconsistent docs.

jsick · November 10, 2016, 8:34pm

I agree, and I’ll re-iterate that our intention is to do this. My caution was simply to say I’m worried it may be non-trivial to do it well.

To add context, a theme I hear consistently when talking with other astronomers is that our C++ API really isn’t relevant. Many even say they don’t care if our C++ APIs are documented at all. This means that our documentation focus needs to be on the Python API, even if most of the code and development effort is in C++. So as we automatically translate API docs from C++ to Python, we need to be constantly asking ourselves whether the Python reference documentation experience is as good as the C++ one. If not, we’re not meeting our “customer’s” needs.

jbosch · November 10, 2016, 8:36pm

I’m okay with either this approach or lots of cross-links. I do think the button/tab for language selection is something I’ve seen more frequently.

I’m worried about this as well, but I have a hard time imagining how we do it without making all of our pybind11 source code into .in files that then get run through some kind of template engine. That would also be a source of pain (it confuses editors/linters and makes the code build dependent on the doc build).

I wonder if we could get by with instead making the C++ documentation dependent on the Python docstrings. We could write the main doc in the pybind11 files and insert it (or tagged blocks of it specifically requested in Doxygen) into the C++ documentation pages. That reduces the utility of the header files themselves as a source of C++ documentation, but @rhl has long argued that we should’t use them for that anyway.

jsick · November 10, 2016, 8:46pm

This is an interesting idea! I like it because:

We only write reStructuredText/numpydoc content. People don’t have to think about doxygen’s syntax.
We eliminate doxygen from the build toolchain entirely. (I.e., the documentation translation now stays in reStructuredText/Sphinx; the translation is from a Python domain to a C++ domain in reStructuredText.)
Python docstrings are now the ‘native’ format for documentation, giving us a natural advantage for being a great experience. The C++ experience is compromised a bit, but this is a trade-off we can afford (I think).

The downside is that the docs are slightly further from the C++ code… I’d be curious to see what C++ developers think.

jbosch · November 10, 2016, 8:55pm

I actually wasn’t proposing we go that far. I was thinking we’d continue to use Doxygen, but add some special commands (probably parsed after running Doxygen to generate XML), to let me write a Doxygen block like this:

/**
 *  Do a thing.
 *
 *  $py:parameters$
 *
 *  $py:returns$
 *
 *  $py:description$
 */
int doThing(...) { ... };

…where all those $py:...$ macros would be replaced by some piece of the Python documentation. We might have to do some translation of types/links in the parameters and returns sections - but I think the documentation system would have all the information it needs to do that - and in cases where the differences are significant we can just rewrite that section instead of using a macro.

kfindeisen · November 10, 2016, 9:32pm

I also think that might be going a little too far. There will be cases where the C++ and Python APIs are quite different (and even if the “customers” don’t care, anybody maintaining the code N years from now will), so there needs to be a way to have C+±only or Python-only documentation.

jsick · November 10, 2016, 9:41pm

You’re right, and this makes the doc infrastructure even easier to make (with less fragile translation code). I was a bit worried there’d be pushback on having this much boilerplate, but that doesn’t seem to be a problem.

I’ll prototype it out.

swinbank · November 10, 2016, 11:14pm

Among the primary goals of the documentation effort should to be to support developers writing code during construction, and developers maintaining and expanding that code during operations. Even if nobody who isn’t drawing an LSST paycheque ever looks at the C++ API reference, it still needs to be top-notch.

jsick · November 11, 2016, 2:15am

Yup, I think @jbosch’s approach will let us essentially do this. It lets us relax the single-source doc requirement just enough that we can write a really good Python API reference in the pybind11 context. Then through the doxygen boilerplate and macros, there’s enough substrate to write additional C++ docs that make sense in the C++ context. This is a huge win over our previous plan of writing API documentation only in the C++ context, which I do have concerns would compromise the Python docs.

Sorry for creating concerns that we’ll have poor C++ docs!

RHL · November 11, 2016, 2:33pm

I don’t quite understand what you are proposing. Can we have a concrete (non-functional!) example of the C++ and python docs, and some idea of what the output would look like?

I’m also concerned about:

While I understand that there may not be a one-to-one mapping (especially with things like python v. C++ containers), I would expect that the API would almost always be essentially the same. The python one might be extended, or we might decide to use properties so goo.get/setFoo() always translate to goo.foo; but random differences that make the APIs more pythonic or nicer scare me.

What did you really have in mind (as opposed to my strawman)?

jbosch · November 11, 2016, 3:33pm

Here’s what I had in mind; I have no idea how much it matches @jsick’s vision and I don’t think either of us knows how difficult this will be to implement. This isn’t my dream scenario; it’s very much limited by what I think is feasible. Also note that this example is perhaps overzealous in its documentation - I think some of the function parameters here are actually self-descriptive enough that we could just skip documenting them entirely in the real world.

Anyhow, I’m imagining this C++ declaration:

/**
 *  $py:brief$
 *
 *  $py:description$
 */
class Box2I {
public:

    /**
     *  Construct a box from its minimum and maximum points.
     *
     *  @param[in] min  minimum pixel position (lower left), inclusive
     *  @param[in] max  maximum pixel position (upper right), inclusive
     */
    Box2I(Point2I const & min, Point2I const & max);

    /**
     *  Construct a box from its minimum points and dimensions.
     *
     *  @param[in] min  minimum pixel position (lower left), inclusive
     *  @param[in] dimensions  width and height of the box; max - min + (1, 1)
     */
    Box2I(Point2I const & min, Extent2I const & dimensions);

    /// $py:all$
    bool contains(int x, int y) const;

};

and this pybind11 wrapper definition:

py::class<Box2I>(module, "Box2I",
        "A rectangular pixel region with integer bounds.\n"
        "\n"
        "Box2I should be used for rectangular regions that only contain\n"
        "complete pixels; use Box2D for boxes with fractional pixels.\n"
        "\n"
        "Parameters\n"
        "----------\n"
        "min : `Point2I`\n"
        "    minimum pixel position (lower left), inclusive\n"
        "max : `Point2I`\n"
        "    maximum pixel position (upper right), inclusive\n"
        "dimensions : `Extent2I`\n"
        "    width and height of the box; max - min + (1, 1). Only one\n"
        "    of 'max' and 'dimensions' should be present.\n")
    .def(py::init<Point2I const &, Point2I const &>("min"_a, "max"_a))
    .def(py::init<Point2I const &, Extent2I const &>("min"_a, "dimensions"_a))
    .def("contains", &Box2I::contains,
        "Test whether a point is within the box.\n"
        "\n"
        "Parameters\n"
        "----------\n"
        "x : `int`\n"
        "    column position of point to test.\n"
        "y : `int`\n"
        "    row position of point to test.\n"
        "\n"
        "Returns\n"
        "-------\n"
        "contained : `bool`\n"
        "    True if the point is within the box, False otherwise.\n"
        "x"_a, "y"_a)
    ;

would produce documentation equivalent to this C++ declaration with self-contained Doxygen:

/**
 *  A rectangular pixel region with integer bounds.
 *
 *  Box2I should be used for rectangular regions that only contain.
 *  complete pixels; use Box2D for boxes with fractional pixels.
 */
class Box2I {
public:

    /**
     *  Construct a box from its minimum and maximum points.
     *
     *  @param[in] min  minimum pixel position (lower left), inclusive
     *  @param[in] max  maximum pixel position (upper right), inclusive
     */
    Box2I(Point2I const & min, Point2I const & max);

    /**
     *  Construct a box from its minimum points and dimensions.
     *
     *  @param[in] min  minimum pixel position (lower left), inclusive
     *  @param[in] dimensions  width and height of the box; max - min + (1, 1)
     */
    Box2I(Point2I const & min, Extent2I const & dimensions);

    /**
     *  Test whether a point is within the box.
     *
     *  @param[in] x   column position of point to test.
     *  @param[in] y   row position of point to test.
     *
     *  @return true if the point is within the Box, false otherwise.
     */
    bool contains(int x, int y) const;

};

A few more notes:

Because Python __init__ methods can’t have docstrings, those have to go in the class docstring.
Because the overloaded constructors look like a single method in Python, we need to document all of the parameters together, and hence we can’t really use the Python documentation to generate the C++ documentation for the constructors.
For the contains method, there’s no overloading and the parameters are the same, so we can essentially use the Python documentation as-is (with perhaps a translation from e.g. True to true).

What you brought up (containers, extensions in Python) covers most of what I had in mind, though I think there are a lot of things in the “extension” category we could consider (not all of which will necessarily be worth our time, as nice as they may be). The other big difference that comes to mind for me is using dtype arguments instead of “F”, “D”, etc. suffixes on class names for wrapped C++ templates.