SQUASH has a new feature to monitor code changes in the LSST software stack.
The plot showing the metric measurements vs. time has now an associated table listing the CI jobs and the packages that have changed with respect to the previous job. The plot and the table are linked so that it is possible to click on a data point and auto scroll to the corresponding job in the table or vice versa.
CI jobs are linked to Jenkins to get more information on each execution and package names are linked to the corresponding commit in GitHub to easily check the code changes.
In order to evaluate if an observed transition is significant or not one can use the “wheel zoom” over the y-axis scale and visually compare with the thresholds for the selected metric.
New features are being planned for the next cycle and feedback is always very welcome!
Good work @afausti! Actually seeing Stack changes connected to performance changes is really awesome.
The case of job 333 is interesting, where a change (albeit not substantial in an absolute sense) was correlated with many changes in the code base. It seems that both Python 3 changes and algorithm changes fell by coincidence into job 333’s time window.
I think this suggests that we should go beyond running
validate_drp on a regular interval by also:
- Running the
validate_drp afterburner for every merge to master in the
- Running the
validate_drp afterburner for every successful user-submitted Jenkins stack-os-matrix job on a ticket branch.
(or something like this)
Doing this would make it easier to trace performance changes to specific code merges, and even identify performance changes before they’re merged.
Nice work, @afausti
Breaking down into seeing the code changes is really nice. And linking to the specific commits is brilliant.
This looks very useful. I’d caution against trying to improve the granularity by increasing the frequency with which we regularly run
validate_drp, however - while that will work for now, I think SQUASH will really start to become useful when we connect it to processing runs on much larger datasets, and I don’t think it will ever be feasible to automatically run that processing very often. Being able to easily fill in the missing runs when needed - i.e. when a human has noticed a regression - might be a better approach, especially if we had a git-bisect like way to minimize the number of extra runs needed to localize a regression to a single commit.
Great point. Hopefully this will be clearer (to me at least) once @frossie, @afausti and I write the QC design document this cycle.
Thanks @jbosch - I suspect the right approach is a mixture of the two (automated runs on small gold master datasets for code regressions, shipping metrics from large ad-hoc) runs for analysis, but as @jsick says we’re hoping to work out the use cases in December. We’ll come knocking