Jenkins service status - Feb 2020

frossie · February 14, 2020, 5:40am

Hey folks so sadly we have had a rough week with Jenkins.

Earlier this week we had to disable the Mac builds when the switch on the build node rack went pfft (annoyingly in an ambiguous and hard to diagnose way). We have an emergency requisition order out for a new switch - as it is literally the only physical system DM has in Tucson, we are not very deep with spares. It will hopefully be here early next week.

Secondly, we have had a number of frustrating issues that are plausibly (but not provably) explained by resource starvation on the CentOS build nodes on the lsp-int kubernetes cluster at NCSA. We have shut down our development science platform on that cluster in an attempt to free up with resources. Unfortunately we are not allowed to run on the larger lsp-stable cluster by NCSA Security, though our LDF friends are trying to add notes to that little over-subscribed cluster. Builds are running now, so maybe that helped, but we won’t know for sure till tomorrow.

Thirdly, as you may have noticed, build times have considerably lengthened since December. I am trying to develop a plan to investigate and improve that but it will take a while to put together the right people to do that.

I am painfully aware of what a giant pain it is to science pipeline developers when the service is not available. These are often hard problems to diagnose because of the complexity of the build chain and our lack of administrative access to the infrastructure, as well as the fact that the best person to deal with problems is now allocated to IT South and busy with urgent summit work.

While we try to keep folks posted, it’s sometimes hard to do when knee-deep in problems. I will start pinning messages to #dm on Slack with updates at @swinbank’s request, but if you need a response please use the #dm-jenkins channel as it is very hard for us to keep up with #dm.

I am very sorry for the irritation and inconvenience this is undoubtedly causing.

frossie · February 14, 2020, 6:58am

Thanks to some first class sleuthing by @ktl we now know that the conda conflict errors at least are not due to resource starvation, but to dirty state following specifying a non-default conda env value in bin/deploy. For a further explanation consult: https://jira.lsstcorp.org/browse/DM-23492

frossie · February 14, 2020, 10:58pm

We have resolved the main issue affecting the builds described in https://jira.lsstcorp.org/browse/DM-23492 and therefore we think we’re good.

Thanks to all who helped and sorry it took so long, it was a tricky one…