Expanding and documenting Pipeline definitions

jbosch · November 24, 2021, 4:42am

Following up on RFC-775

On RFC-775, @natelust and I proposed - without a lot of detail - having the new drp_pipe package separate Pipeline source files (in recipes and ingredients directories) from their expanded forms (in a pipelines directory), which would both be more readable and something more suitable for actual execution. Despite being built by the build system, the expanded forms in the pipelines directory would be committed to git, because that enables a few very nice things:

Users could inspect the pipelines in their most readable form via the GitHub web interface.
Production operators could directly execute (via ButlerURI and GitHub’s raw-file access URIs) expanded, fully-configured pipelines that is are protected as possible from accidental configuration overrides (especially if we bake a multi-package software version hash into those files).
The task and configuration changelog of each pipeline would naturally appear as the git history of those expanded files.

Committing build artifacts to version control is always a bit problematic, but I accepted the RFC without a detailed plan because I wanted to experiment a bit in the implementation to see if we could make it work because of those upsides.

I’ve now done that experimentation (on DM-30891), and I think it’s time to give up on the idea, at least in the form proposed on the RFC. Big thanks to @kfindeisen and @mrawls for helpful feedback on the ticket that saved me from going down a few more dead-ends before arriving at that conclusion.

What that means for drp_pipe is that we’ll remove the recipes directory and put its pipeline “source” definitions in the pipelines directory. For at least most development purposes, those (non-expanded) pipelines are what we’ll run - expansion will happen on-the-fly, as it does with our obs_* package pipelines today. That’s also good because it’s basically what ap_pipe already does - though I do want to keep the ingredients directory in drp_pipe as a place to put pipeline source content that should not be directly run its own, and I think it’s fine if ap_pipe doesn’t ever need to have ingredients; this may just reflect the fact that the DPR pipeline has many more tasks.

That said, those motivations for git-committing the expanded pipelines still stand, and in a follow-up discussion, @natelust and I came up with some ideas for achieving them in other ways.

Making expanded pipelines into docs

The expanded pipeline files really are more readable, and they will get more (relatively) readable as the pipeline source files are refactored in the future to remove duplication. But if the goal is to let humans read them on the web, we don’t need to use GitHub: we can get them into our Sphinx doc builds instead. This will require some tooling support, but I can imagine all kinds of wonderful interactive navigation that I am not remotely capable of implementing myself:

variants of the same pipeline (e.g. for different instruments or test datasets) in tabs;
expanding/collapsing blocks for the configuration associated with each task (maybe even with differences from the task-level defaults highlighted somehow);
views of the pipeline as a graph (something we can already generate via GraphViz);
links to the schemas of catalogs produced by these tasks.

There are a fair number of very long-standing tickets requesting pieces of this, and I think we can do a number of them at once if we can start by running pipeline expansion inside the pipelines.lsst.io build and start extracting things from it into rST (or even directly into HTML).

@jsick , perhaps we could get this started with a brief meeting on possibilities and technologies? If this is mostly writing some kind of Sphinx extension in Python, I think we can find effort in Science Pipelines to do a lot of the work; if it requires HTML or Javascript, we may need to seek more help from other parts of DM (though I suspect there are people in Science Pipelines with that kind of expertise, too - I just don’t know who, other than “not me”).

The Pipelines changelog as a separate git repository

The big problem with committing expanded pipelines to drp_pipe is that the content of those pipelines depends on all of drp_pipe's dependencies. This means a change in one of those upstream packages can easily break the equivalency between the pipeline source we’d planned to put in recipes and the expanded forms in pipelines, without any change (or opportunity to commit) to drp_pipe.

A separate git repository for expanded pipelines whose commits are only machine generated - say, from the same services that produce daily and weekly builds - would not suffer from this problem, however, and it opens up some new possibilities:

In addition to expanded pipelines, we could also record (for each commit, mapping to a particular release of the Science Pipelines):

the release tag for the stack and/or the git commit refs of all dependencies (maybe even via git submodules);
the Jira ticket numbers merged since the last commit to this “changelog” repo (extracted by parsing git commit logs);
schema files and other init-output datasets produced by those expanded pipelines.

That would make this a souped-up version of the already super-useful informal changelog:

you could git clone it locally and use whatever tools you like to inspect/explore the history (git bisect!)
you could use it to directly relate pipeline content and configuration changes to Jira tickets and versions;
with a bit of extra tooling, you could use it to setup, install, or build a pipelines version associated with a particular changelog commit.

It may be that some pretty simple web pages backed by this repo could be better as a way to display pipeline content than integration with the sphinx doc build, too, to the extent that we may not want to bother with the doc integration from the last section, if we can get this stood up quickly instead (and link to it from the regular docs, of course).

There may also be fun things one could do with connections to SQuaSH (any metric we upload should probably come from a pipeline that’s exactly represented in this changelog repo) or build-system experimentation (a submodule view of the packages seems like a nice thing to hang a monolithic, eups-free CMake build on…), but now I’m getting ahead of myself.

Anyhow, I’m not quite sure who to talk to about getting something like this up and running, but I bet somebody who reads this will have some ideas. I should also add that my main concern with this idea is that it may partially duplicate a lot of things that already exist that I don’t know much about (e.g. the lsstsw versiondb repo, the new schema browser). Please chime in if you see this as conflicting with existing ways of doing things.

mrawls · November 24, 2021, 7:03pm

This is a tricky one, thank you for working through it. I’m glad to forego recipes and just have pipelines and (if desired) ingredients, the former of which necessarily has a bunch of imports but is still mostly-human-readable.

Re. “Making expanded pipelines into docs,” if ap_pipe/pipelines is basically fine as-is, I think the most useful addition in the docs department would an auto-generated figure, which is one of your bullet points, i.e.,

pipetask build -p MyPipeline.yaml --pipeline-dot MyPipeline.dot
dot MyPipeline.dot -Tpng > MyPipeline.png

for the full pipeline. Even for the relatively simple ap_pipe pipelines, I have to stare at 2+ yaml files to figure out what a pipeline is actually doing in its entirety, and while only staring at 1 yaml file (or 1 Sphinx page) would be an improvement, an up-to-date pipeline flowchart diagram would really enhance clarity.

Frankly, I bet it would get people to actually bother to navigate to our docs.

Re. “The Pipelines changeling as a separate git repository,” I don’t fully understand the complex dependencies you refer to for drp_pipe, but I’m not opposed to a separate git repository as an independent landing place for fully-expanded pipelines.

kfindeisen · November 24, 2021, 7:53pm

I don’t completely follow why the possibility of upstream changes (I assume in obs_* configs or task defaults) is a fatal flaw in the original proposal, but if the motivation is for pipeline users to be able to grasp the status quo rather than for developers to be able to see the consequences of their changes, then there’s no particular reason to put the expanded pipeline in *_pipe.

However, I’m a bit concerned that both of these proposals add a lot of bells and whistles that could distract from the original goal of “grasp the entire pipeline at once”. Specifically,

For the docs proposal, I’m assuming this would be modeled on our existing infrastructure for task and script pages. Would it be possible to start with just a pipeline dump and an image, as @mrawls proposed? Things like trying to work out which pipelines are “variants”, or schema management seem like they could add a lot of complexity that would make the system harder to use. And for really complex pipelines like DRP, just making the diagram readable is already a challenge.
For the separate git repository, I honestly don’t think the commit/ticket/schema metadata will be very useful unless the repository is built on every merge. We’ve been doing nightly ap_verify runs for a while now, and trying to figure out what caused a given change is always difficult because Science Pipelines sees half a dozen tickets merged on a slow day – and if the nightly fails, which it does reasonably often, the gap just gets bigger. So you’d either have to invest in dedicated infrastructure, or scale back expectations for what the repository can tell you.

P.S. For the record, ap_pipe keeps unexpanded pipelines in pipelines not as any sort of ideal, but because the pipeline compilation system was not yet available, and for forward- and backward-compatibility we needed to always have usable pipelines in pipelines. I fully expected to migrate to the ingredients/recipes system once it was ready.

kfindeisen · November 24, 2021, 7:58pm

Also, just to make sure we’re on the same page (something I’m less confident about recently), how expanded is an expanded pipeline?

The current pipetask build --show pipeline idiom resolves imports but makes no attempt to reconcile multiple layers of config changes (especially Python config files, whose content may not be representable as YAML). Is this what people would be seeing, or would they get a “net” pipeline that’s been simplified as much as possible (e.g., mentions a particular config field at most once)?

jsick · November 24, 2021, 8:08pm

We’ve brought up doing a generation 3 middleware overhaul of the doc infrastructure to encompass Butler datasets and Pipelines, but that hasn’t made it into our cycle planning and epics yet, so sorry about the fair number of very long-standing tickets.

We can definitely chat about this and explore some options. SQuaRE has a standing co-work session on Thursday afternoons and you’re welcome to book a slot with Frossie. We can also get on a video call any other time before the end-of-year break.

There are number of ways of approaching this, from static pre-computation and rendering in the Sphinx docs all the way up to fully-interactive web application hosted on the Rubin Science Platform that can dynamically adapt to different real configurations and datasets. There are a bunch of advantages and disadvantages to either of those approaches.

Keeping in mind that we don’t have much exposure to Gen 3 middleware/Pipelines in SQuaRE’s day-to-day work, one thing that might be really beneficial for me before we meet is for you to develop and gather some background material and design goals/expectations:

A prototype of how you want the docs to work (i.e. a simple wireframe sketch or even a Google docs mock up of what the doc page or UI should contain)
What interactivity you expect to need (what dimensionality do the docs need to have to be useful?)
A description of the input parameters and computation needed to expand the pipeline source files to get the data that’s needed to render the documentation.

ktl · November 24, 2021, 9:40pm

I’m worried that the “pipelines changelog repo” is actually a camel’s nose for a submodule-based monorepo.

Is it for drp_pipe only, or ap_pipe as well, or do they each have a separate one (and are there any others)?

Every release build (including weekly and nightly) already has an associated Web-published eups manifest including git commit refs for all dependencies.

I’m still not sure I see why build artifacts should be committed to GitHub, though.

jbosch · November 25, 2021, 2:30am

I’m not at all sure how much this will have in common with the task and script stuff, but starting with a pipeline dump and SVG graph is exactly what I had intended to do, and pretty much everything I wanted to do with the docs beyond that amounts to navigational aids (GUI elements and links) for that content. Displaying schemas does go a bit further in terms of content, but it’s still a dump of things derivable fully from the pipeline with a viewer sitting on top of that. For “variants”, all I had in mind was the different top-level pipelines in drp_pipe - I just wanted an easy way to flip back and forth between (or view a diff of) e.g. the HSC RC2 vs. DP0.2 pipelines in expanded form.

Interesting - I was assuming that the daily granularity of the existing changelog website along with its popularity meant that daily was probably enough - but that may be the difference between debugging ap_verify failures (or ~equivalently ci_hsc failures) vs. debugging regressions discovered in SQuaSH metrics or monthly processing runs, which is (I gather) where the current changelog really shines. In any case, from a pure size-of-repository standpoint, every merge seems feasible and certainly more useful, and you make a very good point that relying on a successful nightly build to publish something like this would not be ideal (it may even be regression from the current changelog).

Sorry I skipped a few steps on this from where we left off on the DM-30891 Jira discussion; I think it’s worth filling in how my thinking went in between, because “how much we expand” is closely related to that fatal flaw.

The main motivation of expanding the pipelines and committing them to git was for users to get a view of both the status quo and the changes in the pipeline over time (I’m sort of channeling @natelust here, as these were his ideas originally - he just sold me on them). This expansion would include applying all config overrides and writing out complete config files, not just resolving pipeline imports, basically because the configurations are a lot of what we’d want to track and view.
When I started trying to implement that, it immediately became clear that we had to guard against the expanded pipelines being out of sync with their recipes; it’d be a huge problem if a user could modify a recipe, ingredient, or config file, build and push their changes, and have that change be completed ignored when something is run from pipelines/ because there were inadequate consistency checks between the git-committed pipelines/ content and what the recipes generate.
On the other hand, it was also important that superficial or pure-refactoring changes to recipes, ingredients, or configs (i.e. changes that didn’t actually affect what would be run) should not be reflected in changes to the expanded pipelines, or diffs between them would mostly be noise. This also requires pipeline expansion to extend down to configs; if it only expanded imports, something as simple as moving a config override from an obs_* package to a pipeline ingredient file would trigger a spurious change in the expanded pipeline.
If we expand pipelines all the way down to configs, then pipelines must be re-expanded when task defaults or obs_* packages change, not just when ingredients or recipes in *_pipe changes. And that rules out using any kind of git commit hook to automatically update the expanded pipelines in git, because such a hook would need to fire on commits made to repos other than the *_pipe one we are committing to.
If we can’t automatically expand and commit the pipeline, then adding something to the build system to check for consistency (either by leaving the git repo dirty by overwriting the committed files or more test-like code) seems necessary. That’s a potentially big annoyance for users, but I figured that if we could make it also serve as a way to make sure pure-refactoring changes didn’t actually affect the pipeline, we might come out ahead overall; there’d be new pain, but some new gain as well.
Even that idea fell apart once I thought a bit more about developers having to rebuild *_pipe packages every time their dependencies changed, or running out-of-sync pipelines from a clone that had not been built since setting up a new version of a dependency. And @mrawls’ skepticism about the gain being worth the pain was probably well-deserved anyway.

jbosch · November 25, 2021, 2:37am

Sounds perfect; I’ll probably follow-up after the holidays, as I don’t think it’s worth trying to get this ball rolling beforehand (I put this post up earlier to sketch out an alternative primarily because it means not doing the old thing now, not because standing up the new thing is at the top of my priority list).

. I’ll put something like that together before I show up some Thursday. But I think the last bullet is perhaps the most important one, and it’s something I can answer now - I basically need to run a function that takes no parameters in a high-level stack package that depends on many others; I can easily write such a function that dumps all of the content to a directory, and as a few others have noted, just getting that directory structure and the (text+SVG) files within it navigable would be huge first step.

jbosch · November 25, 2021, 3:07am

I’m not going to pretend that I don’t find the idea intriguing, but I’m by no means sold on it as a way to build our code (it’s something I might want to experiment with given more free time than I actually have, but that’s it).

I do, however, think it has substantial advantages over anything we have now or would want to build ourselves as a way to browse/view/diff our code over time; it plugs our multi-package history into a huge ecosystem of tooling that we are for the most part super familiar with (for single-package history).

To be clear, I also think the variant where we commit the expanded pipeline files to this repo and save the associated package versions there in some other way (e.g. EUPS manifests or table files) is super useful. But if we did want to ever commit unreleased versions, I think we’d need to support saving raw git commit refs for dependency repos, and saving those to our own text file when there’s a built-in way to do that would feel backwards.

I think the best answer to this is that it’s a workaround for not having a build system strapped onto the submodule monorepo - instead of being able to build all artifacts from the submodule monorepo while e.g. bisecting, we dump a few particular important build artifacts (the expanded pipelines) directly into that repo from the build systems we already have.

Put another way: this is a step in the direction of submodule-monorepo builds, but it’s an incremental step with less effort and fewer risks (we’re not trying to put together a new system for building binary artifacts for multiple platforms) than going all the way there at once, and by committing a few build artifacts it brings substantial rewards even if it’s the only such step we ever take.

This is where this idea runs up against some hard choices, and I’ll admit I haven’t made any; the big problem I have with submodule-monorepo is how to have (or live without) multiple top-level packages, and that’s a problem even in this more limited form. I’m certainly open to opinions on how to approach this question, if we pursue the idea at all.

On that note, I’ll admit that my enthusiasm for this changelog-git-repo idea has dimmed a bit after thinking a bit about how much work it would be (after prodding on that point in various comments here). I still think it’s at least a neat idea for something that would be very useful - I’m just less sure it’s worth the effort, especially if we do set up nice snapshots of the current expanded-pipeline in the doc builds. Maybe something to bring up again someday if/when we hire a build engineer, especially if it’s somebody who wants to go in the direction of submodule-monorepo anyway and can make a good case for it.