Intermittent Qserv (TAP) query service issue - 2021-07-13

Tags: #<Tag:0x00007fdd7fb5baf8> #<Tag:0x00007fdd7fb5ba08> #<Tag:0x00007fdd7fb5b8c8>

Hi folks,

Several of our DP0.1 delegates have now noticed and/or reported issues with TAP query service, along the lines
of “queries were working fine for me for a while, but then they suddenly seemed to stop returning and/or
started returning server errors…” This is an informational note to let folks know that we are in fact
currently experiencing an issue with the Qserv database server which backs the TAP service, and that any issues experienced that seem to match the above description are most probably related to this same root cause.

We are seeing intermittent occurrences right now on a frequency of once or twice /24hrs. The Qserv
development team is currently pursuing a fix, but until a patch can be developed, tested, and deployed the
situation requires monitoring and manual intervention by an operator to “unstick” the server whenever
the bug is triggered, after which normal query processing resumes.

Though we are actively monitoring the server at this time, if you do find some of your queries appear to be stuck please comment on issue [BUG] Intermittent lock-up of Qserv query processing · Issue #14 · rubin-dp0/Support · GitHub for immediate attention.

Thank you very much for your patience as we hammer out a fix! Some further technical detail and ongoing
status will be threaded here below for those who may be interested.

We believe this to be fixed with deployment of Qserv release 2021.7.1-rc1 this past Friday. The root cause was a resource deadlock in the Qserv query distribution and result-accumulation code. Thank you for your patience while we tracked this one down!