Connection aborted

I’m getting “Connection aborted” intermittently, but frequently on DP0 database queries in Python notebooks. At first, I thought that it had something to do with the CPU selection (small, medium, large) but that hypothesis is not holding up very well.

At this point, I’m just submitting my queries repeatedly until I get a complete response.

Any suggestions?

Hi Suber, thanks for posting. Could you reply with an example ADQL query that fails, the size of the container (and confirm you’re using the recommended image, Weekly 2024_42), and the exact error message you get? That would help a lot for us to reproduce and diagnose the issue.

My setup.

If you order the images as 1, 2, and 3, they should appear in the order 3, 1, 2.

This is a shot of the code. You can see that the job ran for almost 18 minutes and marked itself as complete. The next step is the fetch and that is the step that fails.

This next shot is just the tail end of the error message.

This shot is my previous run that is almost identical to the one that failed. You can see that I was
counting records only while the one that failed was just picking up the objectId field.

Thanks Suber. These error messages ring a bell for me, but typically I’ve found it to be transient, and that if I re-execute the code cell fetching the results, it will connect and complete. Do you find the same, or is the error persisting after multiple attempts to fetch the results?

As I side note, I notice there are no spatial constraints on these queries and so they’re taking 18 minutes. Spatial constraints on RA, Dec are generally recommended, especially when testing and developing code. But I don’t think that is related to the fetch issues here.

Thanks. Yes, the symptom is transient, but has been occuring frequently. My current activity is around a population study based upon magnitude and variability and not a particular location. We know that my search is for a sparse population and the time usage is fine (for me). If you wish, we can close the issue as resolved. I’ll let you know if it becomes are bigger problem.

@MelissaGraham I modified my code to carve up the area of review into 36 sectors. These sectors fit into the area defined by RA of 0° to 90° and DEC of -60° to -20°. In order to process all 36, I coded a Python loop as you can see in the code segment below. Of course, the code for the query was modified from the original which is earlier in this thread. I added a constraint on RA and DEC.

I was surprised that each cycle through the loop takes the same time as one cycle that covers the entire region - ~18 minutes. You can confirm this from the total run time of 10 hrs 53 min.

However, the issue of aborted connection failure occurred for 9 of the 36 runs through the loop. You can see from the output below that the query completes normally in all cases but when the fetch is attempted, the failure is that the connections were aborted. So, I’m assuming that this means that the connections were aborted after the database query completed.

For now, I will continue to test my process without a problem. This will only be an issue when we are operational.

The code block:

i=0
while i < len(sky_block[‘east_ra’]):
query = "SELECT COUNT(*) "
"FROM dp02_dc2_catalogs.Object "
"WHERE scisql_nanojanskyToAbMag(g_psfFlux) > 17 "
"AND scisql_nanojanskyToAbMag(g_psfFlux) < 21 "
"AND g_extendedness = 0.0 "
"AND r_extendedness = 0.0 "
"AND i_extendedness = 0.0 "
"AND coord_ra > {} "
"AND coord_ra < {} "
"AND coord_dec > {} "
"AND coord_dec < {} "
“AND detect_isPrimary = 1”.format(sky_block[‘east_ra’][i],sky_block[‘west_ra’][i],
sky_block[‘south_dec’][i],sky_block[‘north_dec’][i])
print(query)
job = service.submit_job(query)
job.run()
job.wait(phases=[‘COMPLETED’, ‘ERROR’])
print(‘Job phase is’, job.phase)
try:
ut4 = job.fetch_result().to_table()
print("Results in sector “, i,” = ",ut4)
except:
print(“No results in sector “, i,”. Step failed.”)
i+=1

The run stats:

Last executed at 2024-11-03 06:58:30 in 10h 53m 30s

The first 6 cycles of the output, which includes cycle 5 that failed:

SELECT COUNT(*) FROM dp02_dc2_catalogs.Object WHERE scisql_nanojanskyToAbMag(g_psfFlux) > 17 AND scisql_nanojanskyToAbMag(g_psfFlux) < 21 AND g_extendedness = 0.0 AND r_extendedness = 0.0 AND i_extendedness = 0.0 AND coord_ra > 80.0 AND coord_ra < 90.0 AND coord_dec > -30.0 AND coord_dec < -20.0 AND detect_isPrimary = 1
Job phase is COMPLETED
Results in sector 0 = COUNT

0

SELECT COUNT(*) FROM dp02_dc2_catalogs.Object WHERE scisql_nanojanskyToAbMag(g_psfFlux) > 17 AND scisql_nanojanskyToAbMag(g_psfFlux) < 21 AND g_extendedness = 0.0 AND r_extendedness = 0.0 AND i_extendedness = 0.0 AND coord_ra > 70.0 AND coord_ra < 80.0 AND coord_dec > -30.0 AND coord_dec < -20.0 AND detect_isPrimary = 1
Job phase is COMPLETED
Results in sector 1 = COUNT

11713
SELECT COUNT(*) FROM dp02_dc2_catalogs.Object WHERE scisql_nanojanskyToAbMag(g_psfFlux) > 17 AND scisql_nanojanskyToAbMag(g_psfFlux) < 21 AND g_extendedness = 0.0 AND r_extendedness = 0.0 AND i_extendedness = 0.0 AND coord_ra > 60.0 AND coord_ra < 70.0 AND coord_dec > -30.0 AND coord_dec < -20.0 AND detect_isPrimary = 1
Job phase is COMPLETED
Results in sector 2 = COUNT

37536
SELECT COUNT(*) FROM dp02_dc2_catalogs.Object WHERE scisql_nanojanskyToAbMag(g_psfFlux) > 17 AND scisql_nanojanskyToAbMag(g_psfFlux) < 21 AND g_extendedness = 0.0 AND r_extendedness = 0.0 AND i_extendedness = 0.0 AND coord_ra > 50.0 AND coord_ra < 60.0 AND coord_dec > -30.0 AND coord_dec < -20.0 AND detect_isPrimary = 1
Job phase is COMPLETED
Results in sector 3 = COUNT

30442
SELECT COUNT(*) FROM dp02_dc2_catalogs.Object WHERE scisql_nanojanskyToAbMag(g_psfFlux) > 17 AND scisql_nanojanskyToAbMag(g_psfFlux) < 21 AND g_extendedness = 0.0 AND r_extendedness = 0.0 AND i_extendedness = 0.0 AND coord_ra > 40.0 AND coord_ra < 50.0 AND coord_dec > -30.0 AND coord_dec < -20.0 AND detect_isPrimary = 1
Job phase is COMPLETED
Results in sector 4 = COUNT

0

SELECT COUNT(*) FROM dp02_dc2_catalogs.Object WHERE scisql_nanojanskyToAbMag(g_psfFlux) > 17 AND scisql_nanojanskyToAbMag(g_psfFlux) < 21 AND g_extendedness = 0.0 AND r_extendedness = 0.0 AND i_extendedness = 0.0 AND coord_ra > 30.0 AND coord_ra < 40.0 AND coord_dec > -30.0 AND coord_dec < -20.0 AND detect_isPrimary = 1
Job phase is COMPLETED

No results in sector 5 . Step failed.

SELECT COUNT(*) FROM dp02_dc2_catalogs.Object WHERE scisql_nanojanskyToAbMag(g_psfFlux) > 17 AND scisql_nanojanskyToAbMag(g_psfFlux) < 21 AND g_extendedness = 0.0 AND r_extendedness = 0.0 AND i_extendedness = 0.0 AND coord_ra > 20.0 AND coord_ra < 30.0 AND coord_dec > -30.0 AND coord_dec < -20.0 AND detect_isPrimary = 1
Job phase is COMPLETED
Results in sector 6 = COUNT

0

Thanks for this Suber.

I know this is not obvious (and not ideal) but: when spatial constraints are included in an ADQL query as max/min values on the RA and Dec columns, the processing time might not go down. I think the system interprets the request in the same way as other column constraints, and reviews all table rows.

I see now that in under “General Advice” on the DP0.2 ADQL Queries page, it just says “It is recommended to use either an ADQL Cone Search or a Polygon Search, and to not use a WHERE … BETWEEN statement to set boundaries on RA and Dec.” I’ll make an edit there to better explain it’s not just recommended, but necessary for efficient queries.

The other factor affecting your results here is that the DP0.2 area only includes RA of about 50 to 75 deg, and Dec of about -27 to -45 deg. I think that’s why a few of your cycles returned no results, they were just out of bounds. The best figure for a quick reference of the DP0.2 area is Figure 15 from The LSST DESC DC2 Simulated Sky Survey, which I’ll paste below but is also on the DP0.2 Data Products Definitions page.

Below I put together a sample query that I think is similar to what you’re after. It uses the POLYGON constraint to define a 5x5 deg region of sky, applies your other column constraints, and returns the COUNT.

from lsst.rsp import get_tap_service, retrieve_query
service = get_tap_service("tap")

query = "SELECT COUNT(coord_ra) "\
        "FROM dp02_dc2_catalogs.Object "\
        "WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec), "\
        "POLYGON('ICRS', 65.0, -35.0, 65.0, -40.0, 70.0, -40.0, 70.0, -35.0))=1 "\
        "AND scisql_nanojanskyToAbMag(g_psfFlux) > 17 "\
        "AND scisql_nanojanskyToAbMag(g_psfFlux) < 21 "\
        "AND g_extendedness = 0.0 "\
        "AND r_extendedness = 0.0 "\
        "AND i_extendedness = 0.0 "\
        "AND detect_isPrimary = 1 "
print(query)

job = service.submit_job(query)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)

results = job.fetch_result().to_table()
results

The above took about 1.5 minutes to run and returned a count of 29592.

I think that the above is going to set you up for more success, so I’m going to mark this reply post as the “solution” for this topic thread, but: as always, please do start up a new topic with any future questions or issues!

1 Like