Time for a query

mwiesner · July 20, 2023, 2:43pm

I have noticed that there is a large range of time that an ADQL query can take. At 8am US central time the query takes 2 seconds but later in the day it takes many minutes or sometimes doesn’t finish at all. I see from other posts that this is a known issue. I am guessing this is related to the number of users and jobs currently using the RSP. Is there any way for an user to know how many people are using the platform at a time so as to estimate whether now is a good time to do a query? Below is the query I ran:

SELECT fs.coord_ra, fs.coord_dec, ccd.visitId, ccd.detector, fs.forcedSourceId, fs.objectId, fs.ccdVisitId, fs.detect_isPrimary, 
fs.band, scisql_nanojanskyToAbMag(fs.psfFlux) as psfMag, ccd.obsStartMJD, scisql_nanojanskyToAbMag(obj.r_psfFlux) as obj_rpsfMag, obj.r_extendedness
FROM dp02_dc2_catalogs.ForcedSource as fs 
JOIN dp02_dc2_catalogs.CcdVisit as ccd ON fs.ccdVisitId = ccd.ccdVisitId 
JOIN dp02_dc2_catalogs.Object as obj ON fs.objectId = obj.objectId WHERE obj.coord_ra 
BETWEEN 59.58 AND 59.96 AND obj.coord_dec BETWEEN -36.95 AND -36.65 AND obj.detect_isPrimary = 1 AND fs.band = 'r' AND ccd.visitId = 1185205 
AND ccd.detector = 82

fritzm · July 20, 2023, 5:11pm

Hi Matt,

Thanks for the observation and question. We (the RSP database team) have absolutely seen this issue, and it is a principal focus of our ongoing development work to improve this.

As a clarification, we have seen that these slow-downs are currently triggered not so much by cumulative user load, but typically as a result of a small number of concurrent queries that Qserv (RSP’s back-end distributed database) has to work particularly hard to answer (more specifics below). When faced with this, Qserv can consume excessive cluster resources and “starve itself” for ability to process other queries (simple or complex) concurrently.

We don’t expect our users to have to know or guess these particulars about Qserv, and we know the great variability in response times currently seen on the system is frustrating. We are working to implement resource management mechanisms in Qserv to even this out and hope to be rolling out some significant improvements in the upcoming weeks.

Your question re. monitoring is also well taken. While we do have some internal engineering interfaces that we use to keep tabs on this, these cannot practically be exposed to end users. We have discussed offering a generalized “load average” metric to users, as relative indication of how hard the cluster is currently working. That said, we still feel the best approach would be just to keep improving Qserv to the point that our users don’t have to care!

For those interested, some further details on what sorts of queries currently make Qserv work particularly hard (and tips on optimizing this query in particular for Qserv) will follow shortly on this thread. Our CST team also points out they have page ADQL Recipes — Vera C. Rubin Observatory Documentation for Data Preview 0.2 which speaks some general advice for crafting low-impact queries.

Thanks much,
–FritzM.

fritzm · July 20, 2023, 7:17pm

@mwiesner, one thing we can offer right away is that with a small adjustment, the above query can be processed with an oder of magnitude greater efficiency by Qserv. The tweak is to use ADQL CONTAINS(POINT..., POLYGON...) for the area constraint, instead of the obj.coord_ra BETWEEN ... AND obj.coord_dec BETWEEN ... construction:

SELECT 
    fs.coord_ra, fs.coord_dec, ccd.visitId, ccd.detector, fs.forcedSourceId, 
    fs.objectId, fs.ccdVisitId, fs.detect_isPrimary, fs.band,
    scisql_nanojanskyToAbMag(fs.psfFlux) as psfMag, ccd.obsStartMJD, 
    scisql_nanojanskyToAbMag(obj.r_psfFlux) as obj_rpsfMag, obj.r_extendedness
FROM
    dp02_dc2_catalogs.ForcedSource as fs
    JOIN dp02_dc2_catalogs.CcdVisit as ccd ON fs.ccdVisitId = ccd.ccdVisitId 
    JOIN dp02_dc2_catalogs.Object as obj ON fs.objectId = obj.objectId
WHERE 
    CONTAINS(POINT('ICRS', obj.coord_ra, obj.coord_dec), 
        POLYGON('ICRS', 59.58, -36.95, 59.58, -36.65, 59.96, -36.65, 59.96, -36.95))=1
    AND obj.detect_isPrimary = 1
    AND fs.band = 'r'
    AND ccd.visitId = 1185205 
    AND ccd.detector = 82

The reason this helps is that Qserv is able to recognize the CONTAINS(POINT..., POLYGON) construction, and direct the query to just the involved sections of sky. Qserv cannot currently make the same inference from the BETWEEN construction, so it gives up runs the query over the entire sky instead. This, in combination with that fact the the query contains a join on the weighty ForcedSource table, is enough to trigger the resource management pathology in Qserv mentioned in the post at the head of this thread.

We recognize this is a Qserv problem and hope to improve/extend the spatial restriction recognition code soon! In the meantime, expressing query spatial restrictions with ADQL CONTAINS(POINT..., POLYGON) or CONTAINS(POINT..., CIRCLE...) may help you get your queries through the system much more quickly and with much less impact.

mwiesner · July 20, 2023, 10:13pm

Fritz, thank you so much for this detailed response! This is very helpful.