Calculating exposure IDs for obs_lsst cameras and beyond

mrawls · January 31, 2019, 10:39pm

+1 for including the date in the exposure_id, even if it might not be technically necessary, for human readability purposes

ktl · January 31, 2019, 10:41pm

A semi-serious proposal: convert to hex. 1 hex digit DR, 1 hex digit year, 1 hex digit month, 2 hex digit day, 4 hex digit exposure number, 2 hex digit detector number, 5 hex digit source id number.

parejkoj · January 31, 2019, 10:41pm

But how often are we going to need to do that using the raw visit id, instead of from an interface that has already converted it into something more useful, or without having a method readily to hand that converts it into a human readable string? Would you have been able to solve the problem as easily if you’d had to call, e.g. visitId.stringify() to get the date?

Sorry, I meant 17 bits for just the days part (allowing up to <100k days).

My fear is that we’ll end up painting ourselves into a corner like SDSS did with their expectation of <10k plate IDs. That was solveable for them with a fair bit of work, and we could probably avoid it by being appropriately generous with each element of our ids, but a successful (and thus long-lived) project can easily trip over such things, and having to pack decimal values into bits often means having “empty” space that could have been used elsewhere.

parejkoj · January 31, 2019, 10:48pm

Why include the DR? That has nothing to do with the visit/detector ids. Or is that specifically for the source ID?

Also, 1 hex digit year is almost certainly too limiting: that’s only 16 years. I would certainly hope that LSST gets at least one extension beyond the first 10 years…

timj · January 31, 2019, 10:54pm

If that required me to go to a web page or do a database query I would have probably let someone else answer the question on Slack.

ktl · January 31, 2019, 10:54pm

Yes, it’s leaving room for it for the source IDs. It wouldn’t appear in exposure or exposure+detector IDs. I don’t think it would even have to appear in source IDs coming out of the pipeline tasks; the DR bits could be applied at “parquet-ification”/“DPDD-ification” time.

There are 3 bits in the day and at least a couple in the source id that could be used for future expansion beyond 16 years, but it’s rather messy. That’s part of why I said “semi-serious”.

parejkoj · January 31, 2019, 10:56pm

If we were to take this approach (which essentially removes human-readability anyway), I would go straight to bit packing instead of hex packing, as SDSS did for their objids: http://skyserver.sdss.org/dr7/en/help/docs/algorithm.asp?key=objID

timj · January 31, 2019, 10:57pm

Yes, source ID uses detector_exposure_id and that is what is driving this discussion. 128 bit source IDs would be great but that doubles our ID storage and adds complication in the C++ layer (Python doesn’t care how big our integers are). Do Oracle and MariaDB support 128 bit int columns?

I really do worry about the 16 data release problem since that really does seem to be restrictive and we really would like to work with a system that does not cause everything to break when the next 10 year survey gets funded. This is really an example of us not having debated exactly how our source IDs will be formed.

timj · January 31, 2019, 11:15pm

Forming the source ID from a detector_exposure_id could involve mangling the detector_exposure_id. Just because we write it as 2023123109120123 doesn’t mean we have to pack it in the source ID exactly as that.

I think we need 2 hex digits for the year but the detector exposure ID could take up 2+1+2+4+2=11 hex digits, which is still 42 bits I think (if we want < 64 years) and which saves us a few bits.

The above example ID becomes 0x17c1f23a07b in hex form (41 bits). (thank goodness for python int.bit_length() method).

ctslater · February 1, 2019, 2:59am

Hex is pretty unpalatable. I got down to 47 bits with YYdddnnnnn, where 0<=ddd<366. (assuming we will need a new system after year 2121.)

timj · February 1, 2019, 3:03am

Although that could be solely for source IDs which are already opaque entities.

for 9936699999250 I get 44 bits. (99999 exposures on day 366 of year 2099 and detector 250).

ctslater · February 1, 2019, 3:14am

47 with four-digit sensor number, 44 with three-digit sensor. I was working off of the table and hadn’t caught the change.

ktl · February 1, 2019, 4:53am

44 bits seems tight; with only a 4-bit DR identifier, that leaves 16 bits for sources.

timj · February 1, 2019, 2:52pm

How do we guarantee that a source ID won’t clash with a DIAObject or Object ID? Two bits for ID type?

ktl · February 1, 2019, 3:23pm

The wording of the DPDD is slightly unclear, but as I read it, it does not mandate this separation.

ktl · February 1, 2019, 3:26pm

Also note that the DPDD specifies separate conceptual catalog entries for Source ID and “ccdVisitId”. With the proposals here, the latter could be extracted from the former using a simple UDF.

RobSeaman · February 1, 2019, 3:31pm

Presumably there are engineering requirements pertaining to this? Comments above mingle representation and numeric issues. Do the semantics require a sequential count? There is convenience in mapping the exposure ID not just to the calendar, but also the clock. The maximum number of exposures per day is constrained by exposure + readout time. If the latter will always be greater than a second, no matter how short the exposure, then 17 bits is sufficient for a daily count. The remaining 15 bits suffice for 89 years of operation. Converting a 32-bit integer into an ISO-8601 string requires computation in all cases, just start the calendar count at 0h UTC (or TAI or local time or whatever) on day 0 of the LE (LSST Era). One doubts there is a project requirement to embed knowledge of the Gregorian calendar in every ID, rather this belongs in the representation layer. If otherwise “much larger than 32 bits”, allocate more space.

timj · February 1, 2019, 4:10pm

Many telescopes include the date of acquisition in the observation identifier (normally a string and conventionally called OBSID in FITS). The issue here is a unique integer identifier for every CCD exposure. Human readable integers can be useful when glancing at an error report without having to run up a tool to convert ID 1234564 to some day in 2026.

Using day since some epoch + seconds in day (we definitely can’t take more than 86400 exposures per day) + up to 999 detectors results in 41 bit detector_exposure_ids for 50 years.

If, on the other hand, the acquisition system kept track of how many exposures were taken and used that as the exposure_id then if we took 86400 exposures every day for 50 years that fits in 33 bits (so in reality 32 bits is more than enough in that scheme).

timj · February 1, 2019, 5:54pm

After some outside discussion on Slack with @ctslater, @jbosch, and @ktl it seems like we are edging towards:

Source IDs need 20 bits to reflect the number of sources that can be found on a single CCD.
We may want to consider relaxing the requirement that source IDs must include the data release number (which would use 5 bits to cover us for 32 releases).
The ID relating the source on this detector to the exposure ID can therefore use 44 bits.
We do not necessarily require that the detector_exposure_id described here has to be integrated as-is into the source ID. In particular, decoupling the two means that we can have more than 44 bits in the detector_exposure_id and use a human readable version of that and exposure_id and use @ktl’s proposed hex-packed version (which covers 256 years if you drop data releases) for Source IDs themselves. The butler registry should have enough information in it to know what is required for uniqueness.

This all means that I think we can proceed with things more or less as they currently are: retain YYYYMMDD but drop the extra 0 from the detector number and allow a maximum of 99,999 exposures per day.

RobSeaman · February 1, 2019, 6:45pm

Some were present when the FITS convention was set Originally called RECID, a string, this was paired with an integer RECNO. The reason I replied is that you appear to be trying to combine the semantics of both. Does the solution described in the subsequent message meet requirements? Are there requirements broader than LSST? OBSID at NOAO where it originated included fields for telescope and optionally instrument.