Multiple copies of dataset refs from registry.queryDatasets(...)

jchiang · July 10, 2021, 4:59pm

When using butler.registry.queryDatasets, I’m getting multiple copies of some dataset refs, e.g., when asking for calexps for a given visit. Here’s some code to reproduce using the DP0.1 data:

! eups list lsst_distrib
repo = 's3://butler-us-central1-dp01'
butler = dafButler.Butler(repo)
collection = '2.2i/runs/DP0.1'
visit = 748908

raw_refs = list(registry.queryDatasets(datasetType='raw', visit=visit, collections=collection, findFirst=True))
print("raw dataset refs:", len(raw_refs))

calexp_refs = list(registry.queryDatasets(datasetType='calexp', visit=visit, collections=collection, findFirst=True))
print("calexp dataset refs:", len(calexp_refs))
print("unique dataset refs:", len(set(calexp_refs)))

and the output

   21.0.0-3-gc37e2ab+2186fb90a2 	w_2021_25 current setup
raw dataset refs: 189
calexp dataset refs: 404
unique dataset refs: 189

The raw data have the expected 189 refs for a full visit, but there are at least 3 copies of some of the associated calexp references. I see similar behavior for src datatypes.
Why are there multiple copies? Is there a way to have the query return only one copy?

ktl · July 11, 2021, 4:53am

Frequently asked questions — LSST Science Pipelines may help.

jchiang · July 11, 2021, 5:18pm

Thanks, K-T. Since it’s mentioned in the FAQ, I would vote for a having a way to turn on deduplication explicitly when calling these functions. I’d also vote for the docstring at least mentioning that duplicated results are common for these queries. I ran into this converting Gen2 code that iterated over the datarefs from the old butler.subset command and was surprised to find twice as many sources detected per visit using Gen3 code.