How do get tract, patch from queryDatasets of calexp

  1. How do I get tract information from a queryDatasets?

If I

for data_id in butler.registry.queryDataIds(
    ["tract", "visit", "detector"],
    instrument="LSSTCam-imSim",
    datasets="visitSummary",
    collections=collection,
    where="visit=250874",
):

The results I get back (the detectors for that visit) all have the tract as part of the dataId.

But if I run

for data_ref in butler.registry.queryDatasets(
    datasetType="calexp",
    collections=collection,
    instrument="LSSTCam-imSim",
    where=f"tract={tract} and skymap='DC2'",
):

the data_ref.dataId don’t have the tract information. And neither of the following provide it:

data_ref.expanded(data_ref.dataId).full
butler.registry.expandDataId(data_ref.dataId).full
  1. What is queryDataIds doing that means that it can associate the tract information? E.g., is it querying a different datasetType?
1 Like

Hi @mwv , I’d like to try and help out (I’m still learning to use the Butler too).

I think the issue stems from the queryDatasets command specifying datasetType as calexp, because tract is not a “key” or “dimension” for the calexp DatasetType. So tract is not expected to be part of the dataId when calexp is specified.

In contrast, the queryDataIds command can be used to query for data IDs independent of any dataset. So I think that’s why it can return the tract to you, but this is where my knowledge becomes incomplete.

This Butler documentation webpage might be useful here:
When should I use each of the query methods/commands?

Perhaps this is a tangential question, I’m not sure what datasets="visitSummary" means – how did you know to specify that? I assume from the presence of instrument="LSSTCam-imSim" that you are using the DP0 data set at the IDF? But "visitSummary" does not appear as one of the DatasetTypes available with the following command:

for x in sorted(registry.queryDatasetTypes()):
    print(x)

Let me know if the above is helpful, or if we should bring in a Butler expert here!

Hi @MelissaGraham . Thanks, yes, this was after reading the Butler middleware FAQ. I would definitely appreciate an answer from a Butler expert.

Your summary was my assumption, neither “calexp” nor “visitSummary” know about tracts. But what is queryDataIds looking at to get the extra information?

Oh, and I get the same results with datasetType=“visitSummary” as datasetType=“calexp”. I didn’t mean to confuse the issue by using different ones in my example.

I was using the visitSummary as detailed in @lskelvin 's Merian write up at:

I had been running at NERSC. But here’s the complete reproduce for DP0 on the RSP:

Running latest weekly w_2022_15

from lsst.daf.butler import Butler

repo = 's3://butler-us-central1-dp01'
collection = '2.2i/runs/DP0.1'
butler = Butler(repo, collections=collection)

for data_id in butler.registry.queryDataIds(
    ["tract", "visit", "detector"],
    instrument="LSSTCam-imSim",
    datasets="calexp",
    collections=collection,
).limit(10):
    print(data_id)

i = 0
for data_ref in butler.registry.queryDatasets(
    instrument="LSSTCam-imSim",
    datasetType="calexp",
    collections=collection,
    where="tract=4848 and skymap='DC2'",
):
    i = i + 1
    if i > 10:
        break
    print(data_ref.dataId.full)

The spatial relationships are actually stored in the database as relationships between the abstract dimensions (data ID keys), like visit, tract, and patch; the relationships for datasets like “calexp” follow directly from that.

When you use queryDataIds, you can say exactly which dimensions you want those data IDs to have, and when it sees more than one spatial dimension, it’ll automatically join on those (more precisely, it will join on the most fine-grained spatial relationship).

When you use queryDatasets, the data IDs you get back always just use the dimensions of the dataset type you are querying for, though the query itself can join to others (if e.g. they are referenced in a WHERE expression).

I suspect that doesn’t tell you how to do what you actually want to do, but I’m hoping it might help refine the question.

3 Likes

The answer to your question (2) is that the queryDataIds call that you posted is specifically including the visitSummary dataset type via the datasets= keyword argument. The calexps are created before the skymap, and thus aren’t associated with tract/patch info, while visitSummary is a table created by an afterburner task that (I think) is run near the end of processing (i. e., after coadds, etc.).

If you didn’t know about this initially, and wanted to ask the butler “which dataset types have tract as a dimension?”, you could use a command-line query like this:

butler query-dataset-types --verbose $REPO | grep tract

(This could also be done via butler.registry.queryDatasetTypesin a notebook.)

Hope this helps to clarify!

Thanks, @jbosch . That’s really helpful in improving my understand of what’s going on and what I should expect.

So if I just ask for visit and detector, I don’t get tract:
(this is true for either “calexp” or “visitSummary”)

for data_id in butler.registry.queryDataIds(
    ["visit", "detector"],
    instrument="LSSTCam-imSim",
    datasets="calexp",
    collections=collection,
).limit(10):
    print(data_id.full)

produces

{band: 'g', instrument: 'LSSTCam-imSim', detector: 139, physical_filter: 'g_sim_1.4', visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', detector: 95, physical_filter: 'g_sim_1.4', visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', detector: 107, physical_filter: 'g_sim_1.4', visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', detector: 103, physical_filter: 'g_sim_1.4', visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', detector: 96, physical_filter: 'g_sim_1.4', visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', detector: 92, physical_filter: 'g_sim_1.4', visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', detector: 148, physical_filter: 'g_sim_1.4', visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', detector: 181, physical_filter: 'g_sim_1.4', visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', detector: 147, physical_filter: 'g_sim_1.4', visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', detector: 114, physical_filter: 'g_sim_1.4', visit_system: 1, visit: 242014}

but if I asked for tract as well, queryDataIds will go get it for me:

for data_id in butler.registry.queryDataIds(
    ["tract", "visit", "detector"],
    instrument="LSSTCam-imSim",
    datasets="calexp",
    collections=collection,
).limit(10):
    print(data_id.full)

produces

{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', detector: 139, physical_filter: 'g_sim_1.4', tract: 5062, visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', detector: 139, physical_filter: 'g_sim_1.4', tract: 5063, visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', detector: 139, physical_filter: 'g_sim_1.4', tract: 5062, visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', detector: 139, physical_filter: 'g_sim_1.4', tract: 5063, visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', detector: 95, physical_filter: 'g_sim_1.4', tract: 5063, visit_system: 1, visit: 242014}
{band: 'g', instrument: 'LSSTCam-imSim', skymap: 'DC2', detector: 95, physical_filter: 'g_sim_1.4', tract: 5063, visit_system: 1, visit: 242014}

@jbosch Just for my curiousity: For the second case (of calling for [“tract”, “visit”, “detector”], is it the join that produces the duplicates? And is this join also related to why limit(10) didn’t return 10 things?

@jeffcarlin Thanks for the reminder to read the output of queryDatasetTypes more carefully:

list(butler.registry.queryDatasetTypes(["calexp", "visitSummary"]))

produces

[DatasetType('calexp', {band, instrument, detector, physical_filter, visit_system, visit}, ExposureF),
 DatasetType('visitSummary', {band, instrument, physical_filter, visit_system, visit}, ExposureCatalog)]

Ah, interesting. Since tract isn’t a dimension of calexps, I wouldn’t have expected queries to be able to return that in the dataID. Glad to have learned about this feature of queryDataIds!

Yes, I think so. It’s not the only way to get duplicates, but the spatial joins are currently implemented by having an overlap table between each spatial dimension and a common “skypix” tiling (pretty much always configured to be HTM7). So there are really two joins here under the hood, and duplicates where some tract overlapped multiple HTM7 tiles that also overlapped the same visit+detector combinations.

Yes, though I didn’t realize this was the behavior until you pointed it out: we also do post-query filtering in Python to throw away the tract-visit+detector combinations that don’t actually overlap (but do have an HTM7 tile in common). And I bet that’s firing after the LIMIT 10 is applied in the database.

2 Likes