Following up on RFC-775
On RFC-775, @natelust and I proposed - without a lot of detail - having the new drp_pipe
package separate Pipeline source files (in recipes
and ingredients
directories) from their expanded forms (in a pipelines
directory), which would both be more readable and something more suitable for actual execution. Despite being built by the build system, the expanded forms in the pipelines
directory would be committed to git, because that enables a few very nice things:
-
Users could inspect the pipelines in their most readable form via the GitHub web interface.
-
Production operators could directly execute (via
ButlerURI
and GitHub’s raw-file access URIs) expanded, fully-configured pipelines that is are protected as possible from accidental configuration overrides (especially if we bake a multi-package software version hash into those files). -
The task and configuration changelog of each pipeline would naturally appear as the git history of those expanded files.
Committing build artifacts to version control is always a bit problematic, but I accepted the RFC without a detailed plan because I wanted to experiment a bit in the implementation to see if we could make it work because of those upsides.
I’ve now done that experimentation (on DM-30891), and I think it’s time to give up on the idea, at least in the form proposed on the RFC. Big thanks to @kfindeisen and @mrawls for helpful feedback on the ticket that saved me from going down a few more dead-ends before arriving at that conclusion.
What that means for drp_pipe
is that we’ll remove the recipes
directory and put its pipeline “source” definitions in the pipelines
directory. For at least most development purposes, those (non-expanded) pipelines are what we’ll run - expansion will happen on-the-fly, as it does with our obs_* package pipelines today. That’s also good because it’s basically what ap_pipe
already does - though I do want to keep the ingredients
directory in drp_pipe
as a place to put pipeline source content that should not be directly run its own, and I think it’s fine if ap_pipe
doesn’t ever need to have ingredients
; this may just reflect the fact that the DPR pipeline has many more tasks.
That said, those motivations for git-committing the expanded pipelines still stand, and in a follow-up discussion, @natelust and I came up with some ideas for achieving them in other ways.
Making expanded pipelines into docs
The expanded pipeline files really are more readable, and they will get more (relatively) readable as the pipeline source files are refactored in the future to remove duplication. But if the goal is to let humans read them on the web, we don’t need to use GitHub: we can get them into our Sphinx doc builds instead. This will require some tooling support, but I can imagine all kinds of wonderful interactive navigation that I am not remotely capable of implementing myself:
- variants of the same pipeline (e.g. for different instruments or test datasets) in tabs;
- expanding/collapsing blocks for the configuration associated with each task (maybe even with differences from the task-level defaults highlighted somehow);
- views of the pipeline as a graph (something we can already generate via GraphViz);
- links to the schemas of catalogs produced by these tasks.
There are a fair number of very long-standing tickets requesting pieces of this, and I think we can do a number of them at once if we can start by running pipeline expansion inside the pipelines.lsst.io
build and start extracting things from it into rST (or even directly into HTML).
@jsick , perhaps we could get this started with a brief meeting on possibilities and technologies? If this is mostly writing some kind of Sphinx extension in Python, I think we can find effort in Science Pipelines to do a lot of the work; if it requires HTML or Javascript, we may need to seek more help from other parts of DM (though I suspect there are people in Science Pipelines with that kind of expertise, too - I just don’t know who, other than “not me”).
The Pipelines changelog as a separate git repository
The big problem with committing expanded pipelines to drp_pipe
is that the content of those pipelines depends on all of drp_pipe
's dependencies. This means a change in one of those upstream packages can easily break the equivalency between the pipeline source we’d planned to put in recipes
and the expanded forms in pipelines
, without any change (or opportunity to commit) to drp_pipe
.
A separate git repository for expanded pipelines whose commits are only machine generated - say, from the same services that produce daily and weekly builds - would not suffer from this problem, however, and it opens up some new possibilities:
In addition to expanded pipelines, we could also record (for each commit, mapping to a particular release of the Science Pipelines):
- the release tag for the stack and/or the git commit refs of all dependencies (maybe even via git submodules);
- the Jira ticket numbers merged since the last commit to this “changelog” repo (extracted by parsing git commit logs);
- schema files and other init-output datasets produced by those expanded pipelines.
That would make this a souped-up version of the already super-useful informal changelog:
- you could
git clone
it locally and use whatever tools you like to inspect/explore the history (git bisect
!) - you could use it to directly relate pipeline content and configuration changes to Jira tickets and versions;
- with a bit of extra tooling, you could use it to setup, install, or build a pipelines version associated with a particular changelog commit.
It may be that some pretty simple web pages backed by this repo could be better as a way to display pipeline content than integration with the sphinx doc build, too, to the extent that we may not want to bother with the doc integration from the last section, if we can get this stood up quickly instead (and link to it from the regular docs, of course).
There may also be fun things one could do with connections to SQuaSH (any metric we upload should probably come from a pipeline that’s exactly represented in this changelog repo) or build-system experimentation (a submodule view of the packages seems like a nice thing to hang a monolithic, eups-free CMake build on…), but now I’m getting ahead of myself.
Anyhow, I’m not quite sure who to talk to about getting something like this up and running, but I bet somebody who reads this will have some ideas. I should also add that my main concern with this idea is that it may partially duplicate a lot of things that already exist that I don’t know much about (e.g. the lsstsw versiondb repo, the new schema browser). Please chime in if you see this as conflicting with existing ways of doing things.