Database ingest of exposure metadata

As part of the verification datasets effort, several of us have been looking into ingesting the resulting catalogs into a database for easier access from our various analysis codes. I was able to do this very successfully with ingestCatalogTask in the daf_ingest package. The main problem I had was that the task only imports the columns in the catalog, and does not import any CCD-level or exposure-level information. In my case it was visitid and ccdnum that I needed most, but there are lots of header parameters that we might want also want to use. I was able to work around this by adding new columns for visitid and ccdnum to the output catalog, then importing, but this seems like a hack.

Is there a more appropriate short- or medium-term solution for this? Is the plan for this task to be expanded to create a new table for ccds/exposures, or is there another task that will handle this?

This used to be handled by ingestProcessed, which created a separate exposure metadata table in the database that could be linked to the catalogs through shared identifiers.

I’m not certain that it still works, or if it has been superseded by ingestImages. Someone more knowledgeable than I should weigh in on that. But the overall flow of having two separate tasks for this is desirable and intended.

One problem seems to be that there are quantities which are constant across all rows in an afw table, stored e.g. in the FITS header of its file representation, which ingestCatalogTask does not load into database columns. So, even if there is a separate metadata table for CCDs or exposures, you may not be able to join between say sources and their originating CCDs, because the information required for the join may not be in source table columns. Another problem seems to be that you don’t know how to create an exposure or CCD metadata database table.

Is that a good summary?

For the first issue, I was assuming that the pipelines would place at least foreign keys (e.g. columns that identify the originating exposure / CCD of a source) into actual columns, even if the value is constant across all source measurements of a particular granularity. If the pipelines don’t already do that, I guess you would like daf_ingest to deal with afw table metadata available via getMetadata()? I think that’s a reasonable request, but I’m not totally sure how to deal with the corresponding daf::base::PropertyList. Would it be enough to make the user responsible for specifying a list of desired metadata key names? The task might then complain if any of them don’t exist (alternatively: set the corresponding database value to NULL), and might error out if a key maps to an array of values or a complex type (i.e., an lsst::daf::base::Persistable).

For the second concern, ingestProcessed should still work (ingestImages task from pipe_tasks appears to be for something else). I haven’t heard complaints of it not working, though that might just be because nobody is using it. My medium term plan is to move it from datarel to daf_ingest. That script currently extracts and derives information from on-disk FITS image headers. I was also planning to change it to consume metadata from in-memory Exposure objects instead, so that it can be called both after a pipeline run, or as part of pipeline execution (avoiding disk I/O for headers).

That does summarize the problem well, and I agree that splitting it into two issues is a good way to think about it.

For the first problem of creating columns for secondary keys, it sounds like your intention was similar to my solution (adding visitid, ccdnum columns to the FITS files even though the values were the same for all entries). I have no objection to this route, but it seems like it might require a Policy Decision (RFC?). Implementation in the tasks would be trivial if that’s the decision.

On the second issue of a table for exposures, I think separating this into a different step from loading catalogs makes a lot of sense. This part is not as high of priority for me, but I will look into ingestProcessed. Thanks.

I’m not against repeating values many times in general, but we should be aware that this goes against the idea of “database normalization”.

I think keeping metadata in the header is generally a good thing, but somehow that information needs to be passed into the database (at the ingest step makes more sense to me than repeating everything in the columns of the FITS files). One way to solve this in the database is to set up two tables that are related, one with the columnar data from the FITS tables, and the second with the metadata (one entry per visit/ccd file). A column would need to be added to the first table (foreign key) that relates it to the appropriate entry in the second table (i.e. the visit/ccd file that it came from). To get the information that you need you just need to use a join statement between the two tables. I think this is pretty standard, wouldn’t need any real tweaking or configuring (i.e. just put all the metadata in the second table) and would solve the issues that we are having.

Our notional schema sometimes denormalizes for performance, which is more acceptable because most of our catalogs are write-once, read-only.

In this case, though, we’re talking about inserting the foreign keys necessary to join against the exposure metadata table loaded by ingestProcessed. While all the sources for a given exposure share the same ids, all the sources for all exposures get combined into the same result table, so those columns have varying values.

In fact we often have the foreign key already in the table since the source’s unique id is usually derived from an exposure id. So, if we really care about space – which we may in production – we can define stored functions to extract the appropriate key fields from the source id. But for now, I think it is often more convenient to split those out into separate columns, which is what Serge is talking about above.

It’s not totally clear to me what you are saying. Are you suggesting that there WILL be two tables?

One issue that Serge raised was how to know what columns (of duplicated data) to add to the primary table. If you create normalized table(s) then you can always put in all the metadata in the second table without any concerns that you are missing something.

Yes, I’ve always been saying that there will be two tables.

The question is which metadata values are the foreign keys to add to the “fact table” to join with the “dimension table”. The ingest process needs to be told that. This is not about pre-joining or denormalizing, it’s merely about key identification.

I’m confused. As far as I understand how this works, you just add a foreign key column to the “source catalog table” that links to the primary key of the second table with the metadata. I don’t think you use any of the metadata values for the foreign keys. For example, I would think that the foreign key column in the first table could be called “metadataID” or something similar.

The foreign key could be (visitid, ccdnum), more or less. We’re perhaps overloading the term “metadata” here; we obviously don’t want to include innocuous stuff like CCD dewar temperature in every source row, but we do want to include the identifiers require to match a source with its exposure. That identifier(s) could be an opaque value created at ingest (“metadataID”), or it could be a value or values that we are already using (e.g. visitid).

Yep, I agree. But I’m kind of thinking about how this could work in the very general case and therefore make it much easier to write the code since nothing has to be hardcoded or configured. While we don’t use the CCD dwarf temperature (or related metadata), I don’t see any harm in loading everything in the second table (it will be quite small).

The general case is that information about images (dewar temperatures, etc.) is ingested into one (or more) table, each row of which has a unique id, and information about sources is ingested into another table, each row of which has either the unique id from the image table (standard), an equivalent set of columns that forms a unique id in the image table (sometimes more convenient and readable), or a unique id per source derived from the unique id from the image table (requires functions to extract out the necessary parts), or some combination of the three. I don’t think there’s much dispute about any of this. Our baseline schema uses a single unique id per image, but using a combined (visit, ccdId) key is another way of doing the same thing.

Serge is saying that he expected the pipeline task to create the latter column(s) with the foreign key. If it doesn’t, then presumably the value(s) that should go in it(them) is(are) in the table metadata, and the ingest process can create and fill the column(s) if the appropriate metadata is identified via configuration.

Exactly. I was assuming the pipelines would be adding foreign key columns to the afw tables they output. Sticking with the sources-measured-on-a-CCD example, the issue seems to be that my assumption was incorrect: the pipelines output per-CCD afw tables, but because the foreign key (CCD ID) is constant across all rows in a per-CCD table, it is presumably recorded in the table metadata instead. Actually, now that I think about it, you have to identify the CCD to look up the table via the butler, so maybe it’s not even available in the metadata (i.e. it might be encoded in the afw table file name). Colin - can you confirm that it is actually present in the table metadata? Presumably, we’ll have the same issue when processing coadd patches.

Anyway I’m still not sure whether I should assign myself a ticket to support ingestion of afw table metadata (forcing the user to specify the specific keys they are interested in), or whether to file an RFC requesting that reality be changed to match my initial assumption.

The exposure metadata ingestion script (ingestProcessed) will probably ingest all exposure metadata. The main thing I’d like to see there is consistent metadata keys and value types for the exposures in a dataset. Having to do a pass over all exposures just to determine the columns to ingest is not very desirable. Size for CCD metadata shouldn’t be an issue, although I might change my mind about that if you count things like detailed PSF / WCS representations as exposure metadata.

Looking at the output catalogs both from processCcd and imageDifference, I don’t see anything like visit/ccd id information in the metadata/headers. So I think in either case we will need to change the pipeline outputs a bit.

You could get this info from the dataRef. Is that not practical?

@ktl thanks for the clarification.

@smonkewitz my personal preference would be not to put the visit/ccdID information repeatedly in the source catalog by the pipelines. You could that information by parsing the filename or from the dataref (as Simon suggested). Also, while having visitid and ccdid in the source table would be nice it seems more generic (and easier, to me at least) to use a more generic foreign key like “metadataID” that will always just work (but I think I’m repeating myself).

As far as I can tell all of the output catalog files (and metadata) should have the same format, so I doubt you’ll need to loop over all of them to figure out the columns to ingest.

I don’t think parsing file names is an option, as ingestion isn’t supposed to care about the details of how/where the catalogs it gets from the butler are persisted. But I can look at the contents of a dataRef and add its keys (or some user specified subset) as columns. Are dataRef key values typed, or is each value generally just a string (do I need to make users specify column types)?

+100. File name parsing is the definition of unreliable. The metadata we need should be carried within the file and not be left to the whims of the filesystem.

For Decam we have been using dataRefs like visit=12345 ccdnum=33. I wouldn’t mind if those were inserted directly as columns, but there might be other important values that aren’t in the dataRef. E.g., if something was the product of two different datasets, we might want to store identifying information about both.

A dataRef for LSST might look like visit=12345 filter=r raft=1,0 sensor=1,1. I’m not sure if we would rather store “1,0” as a string or apply some sort of regularization to turn that into an integer. I don’t think there’s any types automatically associated with dataRefs right now.

One of the problems with the current script in datarel is that it does not support decam.

Looking at the code there are many IFs trying to adapt it for lsstsim, sdss and cfht cameras.

Actually, this seems to be a difficult problem because the metadata IS camera specific, and so is the db schema. I was wondering what are the plans for that.