Count queries slow in UK-DAC, qserv or RSP

MRead · June 26, 2025, 5:32pm

On the UK-DAC I’ve noticed that select count(*) from dp02_dc2_catalogs.Object queries are slow via our portal but fast via our notebook. On the US RSP they are fast both ways.
We have statistics on our qserv tables so I’m assuming that the difference is that the portal adds a limit clause. So there is either something different in our qserv or RSP that is not parsing/removing this.
We’re a few releases behind on qserv but our RSP is up to date so leaning towards the difference being in qserv but wondered if anyone knows?
I had a serch of various github repos but couldn’t readiliy find anything.
Thanks
Mike

timj · June 26, 2025, 9:21pm

What version of Qserv are you running?

gapon · June 26, 2025, 9:32pm

Yes, LIMIT <N> is what’s ruining the optimization. I will open a JIRA ticket to address this issue. The result of the query shouldn’t be affected by the presence of LIMIT <N>.

gapon · June 26, 2025, 9:34pm

It’s not a version-specific problem. It’s in all versions of Qserv since the optimization was introduced ~3 years ago.

MRead · June 26, 2025, 9:50pm

Think we’re on 2024.5.1-rc7 on our prod cluster.

MRead · June 26, 2025, 9:54pm

Thanks both.
Though I don’t understand why the US-RSP portal query is quick? That has the limit condition in AFAICT

gapon · June 26, 2025, 10:08pm

I’m curious, how faster are queries submitted via the US-RSP portal compared to yours?

gapon · June 26, 2025, 10:10pm

I have registered the following JIRA ticket to address the issue: https://rubinobs.atlassian.net/browse/DM-51585

MRead · June 26, 2025, 10:40pm

On the US-RSP the portal counts query returns in around 5 seconds. This seemed like it’s using the precalculated stats. If I add a WHERE clause it takes a bit longer 30s? Maybe it’s just that you’re qserv is that much quicker.

On ours it takes tens of minutes and I can see it querying each chunk, i.e. actually doing a count.

gapon · June 26, 2025, 11:05pm

The optimized queries should be returned within less than a second. 5 seconds means scan, which is very fast if WHERE is not present. 30 seconds is a reasonable performance if some light-weight WHERE is present. Qserv needs to be fixed to address this problem.

MRead · June 27, 2025, 8:19am

Hi @gapon , going through the RSP though I presume there are a few seconds associated with RSP/TAP overheads and the count query itself is only taking a couple of seconds at the qserv end? If that is doing a scan, counting rows in each chunk/worker? and aggreating the results that’s very impressive! And a thousand times or more quicker than our (HDD prod) system. Hopefully when we get our new NVME based prod system up and running we’ll get closer to this performance.
As a further test I did a US-RSP portal simple count query on ForcedSource, so the limit clause is present. this returns the expected 111 billion row count in just a few seconds, how is this so quick? I was expecting it to take a few hundred seconds. I think I’m missing something still, maybe the newer qserv is more optimized somehow for full simple counts without using the stats?

gapon · June 27, 2025, 6:37pm

No “magic” here. The unconstrained row counting in the actual query execution mode (if no row counter statistics are available to Qserv Czar) is quick because we rely on the MyISAM table engine. The engine quickly reports the number of rows in each chunk table. Counters from all chunks are collected in parallel by all workers using available threads. On top of that, there is an obvious Qserv overhead for parsing/analyzing the query, sending thousands of requests to workers, aggregating counters into the result table, etc. It all takes one or 2 seconds, or less.

If we were to use a different storage engine, such as InnoDB, then things would look very different. Simple COUNT(*) queries against the InnoDB tables require a real scan of all entries in the table’s B-Tree. Such an operation would take many seconds for the large chunk tables.

However, when you do a real scan that involves a constraint, then MySQL would have to read the whole table (or a part of it, depending on the query and availability of the indexes):

SELECT ... FROM ... WHERE something = ...

Such a query would take a while (may be hours) on a table of 111 billion rows.

MRead · June 28, 2025, 12:11am

Thanks @gapon , so simple count queries even without row counter stats should be quick because the myISAM engine isn’t having to actually read the table/rows? Yet on our qserv it takes an hour on the Object table with the LIMIT clause in place to avoid using the counters, it feels like we’re missing something on our set-up that’s not just disk I/O.

As a test I submitted count queries to all Object chunks on all workers directly bypassing the Czar. That took only a second or two to aggregate all the results and they weren’t even run in parallel.

I’ll do some more tests when back from leave to try and understand what’s going on.

Cheers
Mike

gapon · June 28, 2025, 7:16am

Hi Mike! Yes, the simple row-counting query should be quick in MyISAM. Most likely, you’ve encountered a known issue with the older Qserv design, where the entire chunk table is preloaded into the memory of the worker node before Qserv runs the scans against the chunk. And this is done for each chunk, regardless of its size (or table type). The mechanism was found to significantly affect the performance of most scan queries (except for a few corner cases). After I identified this issue, I eliminated this from Qserv. It was done in the following release https://github.com/lsst/qserv/releases/tag/2025.5.1-rc1.

MRead · June 28, 2025, 7:40am

Thanks @gapon that’s put my mind at rest and I can enjoy my holiday now ! I’d read something on changes to memory loading but hadn’t appreciated how it would affect certain things. Our new cluster will have an up-to-date qserv so I’ll be able to confirm this when it’s rolled out.