Requesting feedback re. changes to the pipetask command line interface

natepease · August 18, 2020, 10:04pm

I am working on migrating the existing pipetask command to use the same command line framework as the recently-introduced butler command, and would appreciate feedback on an interface change described below:

We are considering using a feature that allows pipetask subcommands to be “chained” together. This allows the products of one subcommand to be passed to the next subcommand when they are called in the same command execution. For example in the following code, the build subcommand creates a pipeline instance which is passed to the qgraph subcommand, which then uses that pipeline when building a quantum graph (and then saves the quantum graph to a file called qgraph_file).

pipetask build -p ci_hsc_gen3/pipelines/CiHsc.yaml \
    qgraph -d "patch = 69" \
           -b ci_hsc_gen3/DATA/butler.yaml \
           --input $INPUTCOLL \
           -o $COLLECTION \
           --save-qgraph qgraph_file

This differs from the existing pipetask command, where the qgraph subcommand accepts all the options for the build subcommand and implicitly executes build before qgraph. (Similarly in the existing pipetask command, the run subcommand accepts options for the build and qgraph subcommands, and executes build and qgraph before run.) The same call as above with the existing interface would be:

pipetask qgraph -p ci_hsc_gen3/pipelines/CiHsc.yaml \
    -d "patch = 69" 
    -b ci_hsc_gen3/DATA/butler.yaml \
    --input "$INPUTCOLL" \
    -o "$COLLECTION" \
    --save-qgraph "$QGRAPH_FILE"

While the example interface difference is small, with chained subcommands, calling one subcommand does not imply other subcommands. This accomplishes:

Separates concerns of each subcommand.
Allows the possibility of alternative subcommand implementations.
Allows other subcommands to be created, and they can be called between the existing subcommands if appropriate.

However, there are some tradeoffs:

The interface is somewhat more verbose and specific:
If you want to build a quantum graph from a new pipeline, you must call both subcommands with each subcommand’s options after it: pipetask build <build options> qgraph <qgraph options>, whereas before only one subcommand had to be called and all the options could be passed after it: pipetask qgraph <build options and qgraph options>.
You can’t see help for all subcommands at once:
Help for each of the subcommands is viewed separately; pipetask build -h, pipetask qgraph -h, pipetask run -h for help on each of the three existing subcommands. With the existing pipetask implementation, pipetask run -h would show options for build, qgraph, and run.
Some options are repeated:
In some cases, two or more subcommands may have options in common. For example qgraph and run both accept options for configuring a butler. At best it’s annoying to have to enter the same option values for two subcommands (pipetask build -p pipeline.yaml qgraph <butler options> run <same butler options>). I implemented a possible solution called “option forwarding” where some of a subcommand’s options may “forward” to the next subcommand. You can see which options forward from a subcommand in its help output, they are indicated by the text “(f)” after the option’s metavariable information. Forwarded options may be overridden by passing that option to the next subcommand on the command line.

Both versions of the pipetask command are available for testing and experimenting. The new interface is implemented in a command called pipetask2, and the original pipetask command is not changed. Changes were checked in this morning so until the next weekly you will need at least today’s version of daf_butler, obs_base, and ctrl_mpexec.

I would like to know:

If you have a preference for one interface or the other.
If you have questions or concerns about the chained subcommand interface or implementation.
If you have ideas for improvement or alternative approaches.
(anything else you’d like to say about it)

Thanks!

ktl · August 18, 2020, 10:21pm

At first glance, it would seem more Unix-like and flexible to have each chained subcommand be part of an actual Unix pipeline (pipetask build | pipetask qgraph | pipetask run), but I guess it may not be so easy/efficient to express the forwarded state as a stdout stream.

kfindeisen · August 19, 2020, 12:00am

I can’t say I find the new system very easy to use: I have to look up which options belong to which stage of pipetask, a distinction that’s fairly arbitrary to a user who has some data and wants to run a pipeline on it. A complete novice might not even know what order the steps need to be in.

I understand that this helps large-scale and advanced users, but I thought they would have different runners, and that pipetask was intended for light use?

I’ll grant that the new command line doesn’t seem to be any longer than the old one, however. That was a pleasant surprise.

I also have some feedback on specific options:

--butler-config now requires a .yaml file, and not a repository URI. This is confusing, since AFAIK most users think of Butler repositories as a location, not a config file.
The default logging level has been set to WARN. This goes against both general logging convention and the behavior used by all other Science Pipelines programs, which default to INFO. I actually thought the runner had locked up before I figured out what was going on.
Can --log-level be given repeat support, since it’s shared by all three steps?

timj · August 19, 2020, 2:27am

That definitely has to be fixed since it has to be the case that any repository URI you give to the Butler() constructor must be acceptable here. S3 et al URIs must work.

natepease · August 19, 2020, 5:14pm

Using a Unix pipeline is an interesting idea. I think all the products would have to be serialized to pass through stdout/in, right? I wonder how that would interact with logging output and any other printing that may be happening (correctly or not).

ktl · August 19, 2020, 5:23pm

State needing to be passed between subcommands would need to be serialized to stdout or somewhere else with a pointer in stdout. Logging would have to go to stderr. print would have to specify file=sys.stderr.

If that intermediate serialized state can be reused in other contexts, this could be a good thing, but otherwise it’s likely more complexity than desirable for a “light use” activator, especially if the amount of reuse is minimal and/or the sequence of tasks is almost always the same.

timj · August 19, 2020, 5:27pm

I agree. Sounds like a bit a of a nightmare. It also doesn’t really solve the user interface discussion we are having since it still results in duplicating command line options (unless you also tried to embed those options in the serialized output from the previous command)

cwalter · August 19, 2020, 5:45pm

FWIW, purely from a user perspective having the yaml file list the tasks and their order with their internal inputs and outputs sounds like a more straightforward thing to figure out how to use. It also has the advantage of being able to save the commands you are doing in the code itself. Joe Zuntz did something similar here:

natepease · August 19, 2020, 5:49pm

@kfindeisen, thanks for the feedback.

I created https://jira.lsstcorp.org/browse/DM-26388 to track the --butler-config issue.

I see where the existing pipetask sets the root logger to INFO. I’ll do the same in pipetask2, I updated DM-26103 to track this.

If I understand your --log-level request I think it’s already there; you can pass it after the command pipetask2 or after any of the subcommands. It accepts a level for the default logger and key value pairs for any module level logging e.g. lsst.daf.butler=DEBUG. The ... after the metavar is supposed to hint at that, but i could make the help text more explicit as well. Or, am I misunderstanding your request?

kfindeisen · August 19, 2020, 6:03pm

I thought (f) was supposed to mark such options, and I didn’t see it in the help text. I’ll admit I didn’t actually try it.

natepease · August 19, 2020, 6:05pm

Aha. (f) indicates an option that forwards to the next subcommand. There’s description of that above and in the bottom of the help text.

kfindeisen · August 19, 2020, 6:11pm

So, does --log-level already support it?

natepease · August 19, 2020, 6:38pm

Can --log-level be given repeat support, since it’s shared by all three steps?

and

So, does --log-level already support it?

I guess I’m confused if “repeat support” means “accepts multiple” or “forwards”.

--log-level does not forward, it does not need to; because of the way the logging module works, settings persist for the duration of the execution of a python program, they will not change from subcommand to subcommand. (This happens everywhere in python, you can see it in the batch execution of unit tests too.) So if you say pipetask2 --log-level INFO,lsst.daf.buter=DEBUG build qgraph run then those log level settings will be in effect for all three of the subcommands while they execute.

natepease · August 25, 2020, 3:54pm

Based on feedback and discussions in the middleware team, it seems that changing pipetask build, qgraph, and run into chainable subcommands is not the level of granularity we want, and we want to keep the existing pipetask interface.

We are still migrating the pipetask command to the Click-based command line framework. The pipetask2 interface is now very similar to pipetask, in the current daf_butler and ctrl_mpexec (not yet in the weekly). pipetask2 will eventually be renamed pipetask, replacing the existing pipetask command line tool.

Option names/flags may change slightly as we standardize on option names & definitions in our command line tooling. The only current difference in pipetask2 is that -i is no longer an abbreviation for --input, because it abbreviates --instrument in butler.

If you have an opportunity, please test the pipetask2 command with your workflow. As always, we appreciate issue reports and any feedback about the interface.

natepease · September 1, 2020, 6:44pm

pipetask2 as described above (on August 25) is now in the current weekly. Before the next weekly build we would like to rename it to pipetask, and remove the existing pipetask implementation. The change should not be noticeable, excepting the -i change noted above.

If you rely on pipetask this would be a good time to verify that the current pipetask2 works for your use cases. Please let me know if you have any issues.

Tagging @KSK, @hsinfang, and @kfindeisen specifically to be sure they get this message.

kfindeisen · September 2, 2020, 6:29pm

There’s one other change to the UI: --configfile has been replaced by --config-file. No other differences in my case.

natepease · October 6, 2020, 3:51am

As of this evening the old pipetask has been removed and pipetask2 is now pipetask.

There was a conflict with the short option flag -i. With butler subcommands it was used to mean --instrument, and in pipetask it meant --input. We decided to preserve the pipetask meaning.

butler define-visits, write-curated-calibrations, and make-discrete-skymap have been changed, the instrument parameter was required for all of them so it has been made an argument. See --help for those commands for details.