How to query SSSource to obtain one row per ssObjectId?

Hi all, I hope this message finds you well!

I have a question regarding how to perform a particular type of query on the Rubin Science Platform (using the ADQL query window), and specifically the SSSource catalog. I would like my query to return one (or some set number) of rows for each of the ~4.4 million Solar System objects in the catalog; e.g., one row for each ssObjectId (including multiple columns, such as heliocentricDist and the heliocentric velocities, among others). This seems like a straightforward query, but for some reason I am having trouble getting it to work, and I am not sure whether this is due to my admittedly very limited SQL experience, or due to the fact that not all SQL syntax is implemented in ADQL (or some combination of both). I have tried various ways of going about it, such as the following query:

SELECT sss.ssObjectId, min(sss.diaSourceId) as diaSourceId, sss.heliocentricDist
FROM dp03_catalogs_10yr.SSSource AS sss
group by sss.ssObjectId, sss.heliocentricDist

however this returns all entries for each ssObjectId (not just the one with the minimum diaSourceId). Searching around, I found another possible approach here: https://stackoverflow.com/questions/15291506/sql-query-to-select-distinct-row-with-minimum-value and tried the query:

SELECT sss.*
FROM dp03_catalogs_10yr.SSSource AS sss
  INNER JOIN
  (
    SELECT sss.ssObjectId, MIN(sss.diaSourceId) MinDiaSourceId
    FROM sss
    GROUP BY sss.ssObjectId
  ) tbl1
  ON tbl1.id = sss.ssObjectId
WHERE tbl1.MinDiaSourceId = sss.diaSourceId

however, this results in the error: “sub-select not supported in FROM clause.”

I apologize if this is a very basic question or if it has been answered elsewhere on the forum, I have searched around through the entries and have not come across the same question, hopefully I did not miss something. Thank you so much for your time and for any suggestions you may have!

Hi Ellie, thanks for posting this question.

I think the reason why the query was not returning only one row per SSObject was because it asked for all heliocentric distances from the SSSource table for each SSObject. To return them, the results must have more than one row per SSObject: they must have one row per SSSource.

The query below will return one row per SSObject, with the minimum value of the diaSourceId column from the SSSource table. I’ve added “TOP 100” to the “SELECT” statement so that it runs quickly, as a demonstration.

SELECT TOP 100 sso.ssObjectId, MIN(sss.diaSourceId) AS min_diaSourceId
FROM dp03_catalogs_10yr.SSObject AS sso
JOIN dp03_catalogs_10yr.SSSource AS sss ON sso.ssObjectId = sss.ssObjectId
GROUP BY sso.ssObjectId
ORDER BY sso.ssObjectId ASC

Based on the explanation this is not quite what is wanted, but rather a query that returns columns of the SSSource table, but only for rows for which the diaSourceId is its minimum value for a given SSObject.

It is true that subqueries are not supported for DP0.3: the recommended workaround is to use “WHERE … IN …” ADQL statements (see the DP0.3 Table Access and Queries page of the documentation).

The following code snippet, when executed in a Jupyter Notebook in the Notebook Aspect of the Rubin Science Platform at data.lsst.cloud, will yield a results2 table that contains one row per SSObject. The columns will be the ssObjectId, the minimum diaSourceId, and the heliocentricDist for the SSSource record associated with the minimum diaSourceId.

from lsst.rsp import get_tap_service
service = get_tap_service("ssotap")

query1 = """SELECT TOP 100 sso.ssObjectId, MIN(sss.diaSourceId) AS min_diaSourceId
           FROM dp03_catalogs_10yr.SSObject AS sso
           JOIN dp03_catalogs_10yr.SSSource AS sss ON sso.ssObjectId = sss.ssObjectId
           GROUP BY sso.ssObjectId
           ORDER BY sso.ssObjectId ASC"""
results1 = service.search(query1)

string_list = "(" + ", ".join(str(value) for value in results1['min_diaSourceId']) + ")"

query2 = """SELECT ssObjectId, diaSourceId AS min_diaSourceId, heliocentricDist
           FROM dp03_catalogs_10yr.SSSource
           WHERE diaSourceId IN {}
           ORDER BY ssObjectId ASC""".format(string_list)
results2 = service.search(query2)

results2

If this solves your issue, can you mark this reply post as the solution?

If not, please do respond to let us know why and further clarify your end goals. Either way, I’d be interested to hear the motivation behind only wanting data for SSSource records with the minimum value of the diaSource identifier.

EDIT: Caveat: the “WHERE … IN …” statement can only be passed up to about 50000 elements, so this will not work on the entire SSObject table at once. The recommendation is to start by sub-selecting only SSObjects of interest (e.g., by population).

Hi @MelissaGraham , thank you so much for the extremely helpful reply! The suggestion above works perfectly! Just to provide a bit more background, the reason I am interested in selecting one row for each object is that I am working on a project to assess the differences between the predicted Keplerian velocity and the “measured” (simulated) velocities of a wide range of object types, with particular interest in identifying technosignatures and other anomalies from their velocity / acceleration behavior. I wanted to try and select one of each of the full sample of objects to try and make sure I captured all of the unusual classes of objects in the simulation (injected technosignatures, ISOs, etc.), and the reason I chose to select based on minimum diaSource identifier is that that seemed like a way to get a somewhat more random sample than choosing based on e.g. minimum heliocentricDist (or similar), if that makes sense. Let me know if you have any questions and I’m happy to clarify further!

With that in mind, is there a way to implement a query such as the one you have above such that I could retrieve all 4.4 million objects? (maybe I could select batches of 50,000 ssObjectIds, and stitch them together – or perhaps there is a more straightforward approach)? If not I totally understand! Again many, many thanks for your help on this and for your time, I appreciate it so much!!

Hi Ellie thanks for clarifying.

If I understand right, the goal is to do an analysis on one random observation per SSObject. Choosing the observation associated with the minimum diaSourceId might be ok, but this won’t be random: I think diaSourceId is sequential so it would be the first observation for each SSObject. The analysis might benefit from doing the comparative analysis of velocities on the best-measured observation, for example.

Yes, you’re right that analysis in batches is a good approach to this. It’s best to develop and test an analysis process with a relatively small part of the catalog, confirm it works, and then scale up to the full catalog.

Be aware there is a known DP0.3 issue (bottom of the page) with heliocentric positions and velocities, it sounds like this might affect your project.

I’m going to mark my first response as the solution for this topic, because it sounds like it did work for your purposes. But feel free to open a new topic with new questions any time.

Hi Melissa,

Yes, that is right! That is good to know that the diaSourceId is sequential, thanks for the advice on that, much appreciated :slight_smile: Sounds like batching is the right way to go, then! Yes, I am familiar with the issues with the heliocentric positions/velocities but thank you for the reminder. Again, many thanks for your time and help on this – I so appreciate it!

Cheers,
Ellie

1 Like

Our pleasure Ellie, thanks for using the Rubin’s Data Preview 0.3!

1 Like