Problem with megacam dataId

boutigny · September 30, 2016, 12:44pm

This is the continuation of a discussion started on HipChat.

I have run forcedPhotCcd.py on some CFHT images and correctly produced files in output/coadd_dir/forced but I get an error when I try to access the produced catalogs through the butler. For instance if I do the following:

dataId = {'visit':visit, 'ccd':ccd, 'tract':0, 'filter':filter}
forced = butler.get("forced_src", dataId=dataId)

I get :

/home/boutigny/LSST/new/lsstsw/stack/Linux64/daf_persistence/12.0-1-gc553c11+1/python/lsst/daf/persistence/registries.pyc in lookup(self, lookupProperties, reference, dataId, **kwargs)
315 valueList.append(v)
316 cmd += " WHERE " + " AND ".join(whereList)
--> 317 c = self.conn.execute(cmd, valueList)
318 result = []
319 for row in c:

OperationalError: no such column: tract

I the forced_src entry in MegacamMapper.paf file looks ok:

forced_src: {
template: "forced/%(runId)s/%(object)s/%(date)s/%(filter)s/%(tract)d/FORCEDSRC-%(visit)d-%(ccd)02d.fits"
python: "lsst.afw.table.SourceCatalog"
persistable: "ignored"
storage: "FitsCatalogStorage"
tables: raw
tables: raw_visit

but apparently the butler is unable to reconstruct the full dataId from the partial list of provided keywords (visit, ccd, filter and tract) which are yet sufficient.

The problem seems to be related to the mixture between ccd-like keywords (visit, ccd, filter) and the coadd-like keyword (tract).

Following @hsinfang suggestion I tried to provide the complete dataId:

dataId = {'runId':'08BO01', 'object':'SCL-2241_P1','date':'2008-09-02', 'visit':1022064, 'ccd':25, 'tract':0, 'filter':'u'}

and it worked without problem.

This is very annoying as one of the functionality of the butler is to determine automatically the missing keywords. Doing that by hand is not practical.

I have the feeling that it may be possible to implement a trick in megacamMapper.py to force the system to consider “tract” as a valid keyword for forced_src but I don’t know how to do this practically…

price · September 30, 2016, 1:28pm

That’s right. The keyword “tract” isn’t in the registry, and so it breaks. Maybe we should strip out “tract” and “patch” from queries, and reserve them solely for coadd-like data. What do you think, @ktl and @natepease?

boutigny · September 30, 2016, 1:32pm

But “tract” is not optional for forced_src. If we have several tracts there should be as many forced_src datasets

jbosch · September 30, 2016, 3:14pm

I think @price is proposing stripping “tract” out somewhere in the butler implementation, not removing it from the forced_src dataset. I think this will require some kind of fix to the butler.

As a possibly simpler workaround in the meantime, I believe you don’t have to specify a complete data ID to avoid the sort of registry lookups that are causing problems here; you just have to include all of the data ID keys that are needed to fill out the template. Of course, that will still probably require looking up some keys you shouldn’t have to.

boutigny · September 30, 2016, 3:28pm

Yes, the subset of keys present in the template is enough to make it work.

price · September 30, 2016, 3:31pm

Here’s another workaround (as I suggested here):

dataRefList = list(butler.subset("raw", visit=visit, ccd=ccd))
assert(len(dataRefList) == 1)
dataRef = dataRefList.pop()
forced = dataRef.get("forced_src", tract=tract, immediate=True)

The idea is that you’re populating the dataId using the raw product, and using that to get the forced_src.

nchotard · October 11, 2016, 2:37pm

Hello,

I have a question related to this discussion. I would like to get from a butler and a catalog name (‘forced_src’ or ‘deepCoadd_forced_src’ for instance) the full list of available dataIds. The idea would be to loop over this list instead of knowing in advance the exact set of dataIds. Is there a general way to get this list which does not depend on the input catalog?

I found a way to do it for the ‘forced_src’ catalog, but cannot do it for the other using the same trick:

catalog = "forced_src"
butler.getKeys(catalog).pop('tract') # fails otherwise
dataids = [merge_dicts(dict(zip(keys, v)), {'tract': 0}) for v in self.butler.queryMetadata("forced_src", format=keys)]
dataids = [dataid for dataid in dataids if butler.datasetExists(catalog, dataId=dataid)]

Thanks

price · October 11, 2016, 3:13pm

The butler (actually, the mapper which the butler uses) often is backed by a “registry” of metadata for the raw data, which allows us to quickly identify CCDs to be processed. However, there is (currently) no registry of coadd data, so it’s not possible to identify tracts and patches to be processed through the butler. However, what you can do is retrieve the deepCoadd_skyMap and iterate over that:

skyMap = butler.get("deepCoadd_skyMap")
dataIds = (dict(tract=tract.getId(), patch="%d,%d" % patch.getIndex(), filter="HSC-I") for tract in skyMap for patch in tract)
dataIds = [dataId for dataId in dataIds if butler.datasetExists("deepCoadd_forced_src", dataId)]

nchotard · October 18, 2016, 1:29pm

Thanks for your help! That helped a lot.

Nicolas

jbosch · November 9, 2016, 7:04pm

post deleted; @jbosch was not reading the code he was commenting on correctly

ctslater · November 28, 2016, 8:07pm

To close the loop, this issue was fixed in DM-8230.