Starting this topic to collate information from Nate’s discussion of Pipelines during today’s Science Pipelines group discussion.
There were many questions about how we document Configs, how we document Pipelines, and how we document changes to, defaults for, and overrides to those Configs. If the people involved in today’s discussion could please weigh-in (I was not able to take notes during the conversation), it would be very useful to have some of this down in writing. A good start might be @natelust’s list of questions at the end of his presentation.
I will note that we do not have anywhere in the DM dev guide that talks about Config-specific documentation (besides the Config doctype): I think a page summarizing all of those (once we settle on processes) would be very useful.
I think requiring that every config value be explained with a comment is a reasonable start; we can do this in both Python and YAML files. It’s just a matter of getting the practice into the “company culture” so that it gets enforced in review.
I’m less sure that we need a special process for documenting changes to values. Default changes are, in general, breaking changes, so they should be RFC’d, put in release notes, etc. per existing policy. It wasn’t clear to me from the discussion what that misses.
I think changes to configs (including adding an override at a level where none existed before) should be in git commit comments, rather than code comments or other forms that are associated with the state of the configuration rather than a diff. Using git blame to find that kind of information is often not a great experience, but I suspect it’s a better experience (and much less maintenance) than what we’d get trying to do that kind of investigation with bespoke tooling.
Hopefully those commits can usually reference other more documentation of the change as @kfindeisen noted (and despite the unfortunate Trac precedent mentioned in the live call, I think a Jira ticket is fine, and probably the default thing to reference - we should not be developing with the mindset that it’s our job to guard against all information in Jira someday becoming inaccessible).
I don’t think we need to be terribly defensive in our coding, either. My point with the Trac example was that we should try to make comments reasonably self-contained (e.g., “Value chosen to minimize false positive risk” as opposed to “See ticket#1234”). But that’s not an argument specific to configs.
I agree we should have at minimum an in-line YAML comment for each and every config value in the “official” Pipelines we distribute. I’d also find a more verbose description for each pipeline useful, e.g., there is one called “Forced” in pipe_tasks right now and I’m guessing it does forced photometry (that is based only on the fact that it includes two tasks with the name “ForcedPhot” in them).
Who would run this pipeline and why? What are the inputs and outputs? We are comfortable with all our various processing workflows, and can theoretically generate quantum graphs to our heart’s content (if we ask Jim or Nate to remind us how), but we shouldn’t assume this is common knowledge.
I’m also wondering where it makes sense for Pipelines YAML files to land in the first place. Right now I know there are a bunch in pipe_tasks, and one in ap_pipe, and others … presumably in other random packages. It would be helpful to have an (auto-generated?) index of them all somewhere, perhaps addressing why they live where they do and delineating the config choices.
I’m hoping at least some of this will be addressed in a future Gen 3 “getting started guide,” which would be a natural place to demonstrate one of our pre-built pipelines as well as a user-generated variant.