Appropriate use of "raw" and "raw_skyTile" tables in mappers

Mappers suffer from a lot of cargo-culting, and one aspect that seems particularly susceptible to that are the “table” entries, which seem to take values of either “raw” or “raw_skyType”. I think these are related to either the registry or source association, and I’m curious which datasets (if any) should have these set. I’m guessing we should have at least one of them for all datasets with data ID keys similar to those of raw data (for which registry lookup applies), and none for the rest?

raw_skyTile is an old (> 5 years?) mechanism for giving the CameraMapper some spatial awareness. I believe it was used for the Association Pipeline (ap) product. I believe support for it is still built into the CameraMapper, but we don’t use it (at least, it’s not populated by the ingestImages.py script). I believe you can drop it universally, and use only raw for datasets that care about visit, ccd and the like.

1 Like

Paul is correct. (I think Jim can mark his response as a “best answer”?)

Follow-up question: HSC also uses raw_visit. Is this still useful?

obs_decam still uses raw_visit afaik.

The raw_visit table in the registry is built by ingestImages.py from the raw table. Currently, there’s nothing in the raw_visit table that’s not in the raw table (@rhl has pointed out that this is rather a waste of space and that the schema should be normalised), but we can’t just drop it because daf_butlerUtils is looking for it explicitly. ingestImages.py could be refactored to store visit-level metadata in the raw_visit table rather than just dumping it all in the raw table.

So, bottom line: as things stand at the moment (because of the duplication between raw and raw_visit), I don’t think we gain anything by including it in the various mappings, but it doesn’t hurt to keep it and keeping it may save some work if ingestImages.py is ever refactored to do things more correctly.

“raw_visit” was meant to be a smaller, subsetted table that enabled rapid lookups based only on visit-level information, not including CCD or other information. This is primarily useful for optimizing the finding of associated calibration information, in which case it provides a substantial speedup.

Duplication of data wasn’t a problem previously, but it could possibly be ameliorated by making “raw” into a view that joins the visit/ccd information with the “raw_visit” contents.

Is there any specification of the schema for these tables? When @price built them for HSC I think he just added anything that might be interesting (and then balked at adding some extra visits because the sqlite was already slow and the files were enormous).

Is this being addressed as part of obs_base?

The only columns that should be in “raw_visit” are those used to identify the visit (usually just visit id, filter if SDSS since there is more than one filter per visit) and those used to look up calibration information, which by default is just the filter, observation time and exposure length but could include others if specified in the policy for particular calibration datasets. Anything else does not assist the Butler.

“raw” should include columns that might be used in identifying a dataset (i.e. used in a data id or in a path template). Again, any other metadata does not assist the Butler.

Creation of these tables was the responsibility of the genInputRegistry.py scripts in each obs_* package. My understanding is that this has been replaced by a more generic “ingest” system, so there should be no need for this to be part of obs_base.

The more generic system being in pipe_drivers (what a terrible name!)?