Calculating exposure IDs for obs_lsst cameras and beyond

timj · January 31, 2019, 5:42pm

Whilst working on the obs_lsst metadata translators we’ve been thinking about how exposure Ids are formed. Currently in gen2 butler exposureId seems to be equivalent to CcdExposureId (ie an integer uniquely specifying data from a particular visit/exposure and CCD combination. For obs_lsst we form these IDs in a number of different ways and all of them now differ from how it’s done in other obs packages.

For this discussion I will use the term detector_exposure_id to refer to CcdExposureId and exposure_id to refer to the integer associated with the visit/exposure (but identical for all CCDs).

In the table below nnn means zero-padded integer and YMDHMSF are derived from a date of ISO format YYYY-MM-DDTHH:MM:SS.FF

Camera	exposure_id	example	detector_exposure_id	example	nbits
LSSTCam	YYYYMMDDnnnnnn	20231231000123	exposure_id+nnnn	202312310001230000	58
AuxTel	YYYYMMDDnnnnnn	20180920000065	exposure_id	20180920000065	48
TS8	YYYYMMDDHHMMSSF	201807241041568	exposure_id+nn	20180724104156804	55
Phosim	RUNNUM	204595	exposure_id+nnnn	2045950038	37
Imsim	RUNNUM	3010002	exposure_id+nnnn	30100020036	37
UCDCam	YYYYMMDDHHMMSS	20181205233148	exposure_id+n	201812052331480	51

Where the number of bits listed is for the maximum value of detector_exposure_id we will encounter.

Before we add obs_lsst into lsst_distrib I would like us to have one last look at how we calculate detector_exposure_id. In particular the max number of bits is much larger than the 32 that we have been using in the past (I’m about to add tests to obs_base that check this for consistency).

For LSSTCam and AuxTel the exposure_id is constructed from the day of observation and a zero-padded sequence number. Currently the sequence number reserves space for 999,999 observations a day. Given we only have 86400 seconds per day that should be changed to support 99,999. I believe that we could in theory take more than 10,000 images in a day so we can’t reserve less space than that. Using YYMMDD format instead of YYYYMMDD will also help. For the detector_exposure_id 4 digits are reserved but this should be 3 (the fourth leading zero was a spacer to make it more readable).

With these changes a detector_exposure_id of 99123199999250 (250 max detectors) requires 47 bits (45 bits in 2032). AuxTel 39.

TS8 is currently configured as if it has no more than 99 detectors but can be read out multiple times per second. Saying that we will never write more than one exposure per second, using the 2-digit year and saying we can have 999 detectors (we are currently debating how to represent this), results in 49 bits still being required. UCDCam is similar but only 3 detectors. TS3 data will be the same but possibly have 3 digits for detector number.

Phosim and Imsim use the run number and the detector number but we only need to use 3 digits for detector number. They more or less fit in 32 bits with space for 250 detectors.

All of these numbers exceed the 32 bits that we assume is more than enough for detector_exposure_id.

Comments welcomed. I intend to make the changes proposed above to at least help the situation.

jbosch · January 31, 2019, 6:11pm

Presently these IDs are used in the pipelines in two ways:

As random number generator seeds (we want these to be deterministic but not the same for different units of data, and these IDs achieve that).
As a component of our source IDs, which are formed by combining an autoincrement number within each exposure+detector image with the detector_exposure_id after bitshifting the latter.

The pressure to keep the number of bits small comes from the latter: we’ve thus far assumed the combination will fit in a 64-bit integer. A detector_exposure_id that occupies 58 bits definitely breaks that assumption (for cameras that observe sources - I think this argument is essentially irrelevant for test stands, unless they sometimes observe many more spots than I’d expect; in any case I can’t comment usefully on this question for test stands). Shrinking that down to 47 bits should be safe (that leaves space for 131k sources per CCD, assuming unsigned integers). But it’s close enough that I think we do want to squeeze it down to that level. I doubt we’ll actually get anywhere close to 16k sources (50 bits in detector_exposure_id, or 49 with signed integers) on a CCD, but it’s getting to the point where I might be worried about detecting on images of a globular cluster or the galactic bulge in good seeing, and that’s just not something we want to have to worry about if we don’t have to.

ktl · January 31, 2019, 9:26pm

We need a few more bits as well: the DPDD §2.4 says “all IDs shall be unique across databases and database versions […] For example, DR4 and DR5 […] will share no identical Object, Source, DIAObject or DIASource IDs”

timj · January 31, 2019, 9:40pm

Switching exposure ID to seconds since epoch requires 42 bits over 100 years for 999 detectors. We effectively abandon DAYOBS+SEQNUM at that point.

For a data release you could define your own exposure ID from the data backbone file count. That would result in different answers depending on whether you processed the file before it got ingested or are using the raw raw file rather than the raw file with augmented header.

parejkoj · January 31, 2019, 10:14pm

Boost has a multiprecision mode that would allow 128bit integers. But that seems slightly problematic. I am somewhat surprised that there isn’t a 128bit integer standard yet at this point.

Instead of seconds since epoch, would “days since epoch”+SEQNUM be a useful balance between “human readable” and few bits? I think “days since epoch” needs 17 bits (if I did the math right).

Taking an entirely different tack, is the gain from having somewhat human readable exposure/exposure_detector/source ids, as opposed to purely bit-packed numbers (i.e. where the values only make sense in hex or after some code unpacks them) worth coming close to the edge of the int64 limit?

timj · January 31, 2019, 10:20pm

Only yesterday on Slack I solved a problem that someone was happening precisely because I could work out the day of observation from the visit ID. YYMMDD really is useful for giving a sense of when the data were taken and remembering that there some some major change to the system at some point. Of course, if you have an external constraint of < N bits then all that is irrelevant.

Using day since epoch is 40 bits I think (<10,000 days, <100,000 observations per night and < 1000 detectors).

timj · January 31, 2019, 10:21pm

ie I can fiddle with the test data and leave it roughly how it is, but the LSSTCam question is the big one.

mrawls · January 31, 2019, 10:39pm

+1 for including the date in the exposure_id, even if it might not be technically necessary, for human readability purposes

ktl · January 31, 2019, 10:41pm

A semi-serious proposal: convert to hex. 1 hex digit DR, 1 hex digit year, 1 hex digit month, 2 hex digit day, 4 hex digit exposure number, 2 hex digit detector number, 5 hex digit source id number.

parejkoj · January 31, 2019, 10:41pm

But how often are we going to need to do that using the raw visit id, instead of from an interface that has already converted it into something more useful, or without having a method readily to hand that converts it into a human readable string? Would you have been able to solve the problem as easily if you’d had to call, e.g. visitId.stringify() to get the date?

Sorry, I meant 17 bits for just the days part (allowing up to <100k days).

My fear is that we’ll end up painting ourselves into a corner like SDSS did with their expectation of <10k plate IDs. That was solveable for them with a fair bit of work, and we could probably avoid it by being appropriately generous with each element of our ids, but a successful (and thus long-lived) project can easily trip over such things, and having to pack decimal values into bits often means having “empty” space that could have been used elsewhere.

parejkoj · January 31, 2019, 10:48pm

Why include the DR? That has nothing to do with the visit/detector ids. Or is that specifically for the source ID?

Also, 1 hex digit year is almost certainly too limiting: that’s only 16 years. I would certainly hope that LSST gets at least one extension beyond the first 10 years…

timj · January 31, 2019, 10:54pm

If that required me to go to a web page or do a database query I would have probably let someone else answer the question on Slack.

ktl · January 31, 2019, 10:54pm

Yes, it’s leaving room for it for the source IDs. It wouldn’t appear in exposure or exposure+detector IDs. I don’t think it would even have to appear in source IDs coming out of the pipeline tasks; the DR bits could be applied at “parquet-ification”/“DPDD-ification” time.

There are 3 bits in the day and at least a couple in the source id that could be used for future expansion beyond 16 years, but it’s rather messy. That’s part of why I said “semi-serious”.

parejkoj · January 31, 2019, 10:56pm

If we were to take this approach (which essentially removes human-readability anyway), I would go straight to bit packing instead of hex packing, as SDSS did for their objids: http://skyserver.sdss.org/dr7/en/help/docs/algorithm.asp?key=objID

timj · January 31, 2019, 10:57pm

Yes, source ID uses detector_exposure_id and that is what is driving this discussion. 128 bit source IDs would be great but that doubles our ID storage and adds complication in the C++ layer (Python doesn’t care how big our integers are). Do Oracle and MariaDB support 128 bit int columns?

I really do worry about the 16 data release problem since that really does seem to be restrictive and we really would like to work with a system that does not cause everything to break when the next 10 year survey gets funded. This is really an example of us not having debated exactly how our source IDs will be formed.

timj · January 31, 2019, 11:15pm

Forming the source ID from a detector_exposure_id could involve mangling the detector_exposure_id. Just because we write it as 2023123109120123 doesn’t mean we have to pack it in the source ID exactly as that.

I think we need 2 hex digits for the year but the detector exposure ID could take up 2+1+2+4+2=11 hex digits, which is still 42 bits I think (if we want < 64 years) and which saves us a few bits.

The above example ID becomes 0x17c1f23a07b in hex form (41 bits). (thank goodness for python int.bit_length() method).

ctslater · February 1, 2019, 2:59am

Hex is pretty unpalatable. I got down to 47 bits with YYdddnnnnn, where 0<=ddd<366. (assuming we will need a new system after year 2121.)

timj · February 1, 2019, 3:03am

Although that could be solely for source IDs which are already opaque entities.

for 9936699999250 I get 44 bits. (99999 exposures on day 366 of year 2099 and detector 250).

ctslater · February 1, 2019, 3:14am

47 with four-digit sensor number, 44 with three-digit sensor. I was working off of the table and hadn’t caught the change.

ktl · February 1, 2019, 4:53am

44 bits seems tight; with only a 4-bit DR identifier, that leaves 16 bits for sources.