Whilst working on the obs_lsst metadata translators we’ve been thinking about how exposure Ids are formed. Currently in gen2 butler exposureId seems to be equivalent to CcdExposureId (ie an integer uniquely specifying data from a particular visit/exposure and CCD combination. For obs_lsst we form these IDs in a number of different ways and all of them now differ from how it’s done in other obs packages.
For this discussion I will use the term detector_exposure_id to refer to CcdExposureId and exposure_id to refer to the integer associated with the visit/exposure (but identical for all CCDs).
In the table below nnn means zero-padded integer and YMDHMSF are derived from a date of ISO format YYYY-MM-DDTHH:MM:SS.FF
Camera
exposure_id
example
detector_exposure_id
example
nbits
LSSTCam
YYYYMMDDnnnnnn
20231231000123
exposure_id+nnnn
202312310001230000
58
AuxTel
YYYYMMDDnnnnnn
20180920000065
exposure_id
20180920000065
48
TS8
YYYYMMDDHHMMSSF
201807241041568
exposure_id+nn
20180724104156804
55
Phosim
RUNNUM
204595
exposure_id+nnnn
2045950038
37
Imsim
RUNNUM
3010002
exposure_id+nnnn
30100020036
37
UCDCam
YYYYMMDDHHMMSS
20181205233148
exposure_id+n
201812052331480
51
Where the number of bits listed is for the maximum value of detector_exposure_id we will encounter.
Before we add obs_lsst into lsst_distrib I would like us to have one last look at how we calculate detector_exposure_id. In particular the max number of bits is much larger than the 32 that we have been using in the past (I’m about to add tests to obs_base that check this for consistency).
For LSSTCam and AuxTel the exposure_id is constructed from the day of observation and a zero-padded sequence number. Currently the sequence number reserves space for 999,999 observations a day. Given we only have 86400 seconds per day that should be changed to support 99,999. I believe that we could in theory take more than 10,000 images in a day so we can’t reserve less space than that. Using YYMMDD format instead of YYYYMMDD will also help. For the detector_exposure_id 4 digits are reserved but this should be 3 (the fourth leading zero was a spacer to make it more readable).
With these changes a detector_exposure_id of 99123199999250 (250 max detectors) requires 47 bits (45 bits in 2032). AuxTel 39.
TS8 is currently configured as if it has no more than 99 detectors but can be read out multiple times per second. Saying that we will never write more than one exposure per second, using the 2-digit year and saying we can have 999 detectors (we are currently debating how to represent this), results in 49 bits still being required. UCDCam is similar but only 3 detectors. TS3 data will be the same but possibly have 3 digits for detector number.
Phosim and Imsim use the run number and the detector number but we only need to use 3 digits for detector number. They more or less fit in 32 bits with space for 250 detectors.
All of these numbers exceed the 32 bits that we assume is more than enough for detector_exposure_id.
Comments welcomed. I intend to make the changes proposed above to at least help the situation.
Presently these IDs are used in the pipelines in two ways:
As random number generator seeds (we want these to be deterministic but not the same for different units of data, and these IDs achieve that).
As a component of our source IDs, which are formed by combining an autoincrement number within each exposure+detector image with the detector_exposure_id after bitshifting the latter.
The pressure to keep the number of bits small comes from the latter: we’ve thus far assumed the combination will fit in a 64-bit integer. A detector_exposure_id that occupies 58 bits definitely breaks that assumption (for cameras that observe sources - I think this argument is essentially irrelevant for test stands, unless they sometimes observe many more spots than I’d expect; in any case I can’t comment usefully on this question for test stands). Shrinking that down to 47 bits should be safe (that leaves space for 131k sources per CCD, assuming unsigned integers). But it’s close enough that I think we do want to squeeze it down to that level. I doubt we’ll actually get anywhere close to 16k sources (50 bits in detector_exposure_id, or 49 with signed integers) on a CCD, but it’s getting to the point where I might be worried about detecting on images of a globular cluster or the galactic bulge in good seeing, and that’s just not something we want to have to worry about if we don’t have to.
We need a few more bits as well: the DPDD §2.4 says “all IDs shall be unique across databases and database versions […] For example, DR4 and DR5 […] will share no identical Object, Source, DIAObject or DIASource IDs”
Switching exposure ID to seconds since epoch requires 42 bits over 100 years for 999 detectors. We effectively abandon DAYOBS+SEQNUM at that point.
For a data release you could define your own exposure ID from the data backbone file count. That would result in different answers depending on whether you processed the file before it got ingested or are using the raw raw file rather than the raw file with augmented header.
Boost has a multiprecision mode that would allow 128bit integers. But that seems slightly problematic. I am somewhat surprised that there isn’t a 128bit integer standard yet at this point.
Instead of seconds since epoch, would “days since epoch”+SEQNUM be a useful balance between “human readable” and few bits? I think “days since epoch” needs 17 bits (if I did the math right).
Taking an entirely different tack, is the gain from having somewhat human readable exposure/exposure_detector/source ids, as opposed to purely bit-packed numbers (i.e. where the values only make sense in hex or after some code unpacks them) worth coming close to the edge of the int64 limit?
Only yesterday on Slack I solved a problem that someone was happening precisely because I could work out the day of observation from the visit ID. YYMMDD really is useful for giving a sense of when the data were taken and remembering that there some some major change to the system at some point. Of course, if you have an external constraint of < N bits then all that is irrelevant.
Using day since epoch is 40 bits I think (<10,000 days, <100,000 observations per night and < 1000 detectors).
But how often are we going to need to do that using the raw visit id, instead of from an interface that has already converted it into something more useful, or without having a method readily to hand that converts it into a human readable string? Would you have been able to solve the problem as easily if you’d had to call, e.g. visitId.stringify() to get the date?
Sorry, I meant 17 bits for just the days part (allowing up to <100k days).
My fear is that we’ll end up painting ourselves into a corner like SDSS did with their expectation of <10k plate IDs. That was solveable for them with a fair bit of work, and we could probably avoid it by being appropriately generous with each element of our ids, but a successful (and thus long-lived) project can easily trip over such things, and having to pack decimal values into bits often means having “empty” space that could have been used elsewhere.
Why include the DR? That has nothing to do with the visit/detector ids. Or is that specifically for the source ID?
Also, 1 hex digit year is almost certainly too limiting: that’s only 16 years. I would certainly hope that LSST gets at least one extension beyond the first 10 years…
Yes, it’s leaving room for it for the source IDs. It wouldn’t appear in exposure or exposure+detector IDs. I don’t think it would even have to appear in source IDs coming out of the pipeline tasks; the DR bits could be applied at “parquet-ification”/“DPDD-ification” time.
There are 3 bits in the day and at least a couple in the source id that could be used for future expansion beyond 16 years, but it’s rather messy. That’s part of why I said “semi-serious”.
Yes, source ID uses detector_exposure_id and that is what is driving this discussion. 128 bit source IDs would be great but that doubles our ID storage and adds complication in the C++ layer (Python doesn’t care how big our integers are). Do Oracle and MariaDB support 128 bit int columns?
I really do worry about the 16 data release problem since that really does seem to be restrictive and we really would like to work with a system that does not cause everything to break when the next 10 year survey gets funded. This is really an example of us not having debated exactly how our source IDs will be formed.
Forming the source ID from a detector_exposure_id could involve mangling the detector_exposure_id. Just because we write it as 2023123109120123 doesn’t mean we have to pack it in the source ID exactly as that.
I think we need 2 hex digits for the year but the detector exposure ID could take up 2+1+2+4+2=11 hex digits, which is still 42 bits I think (if we want < 64 years) and which saves us a few bits.
The above example ID becomes 0x17c1f23a07b in hex form (41 bits). (thank goodness for python int.bit_length() method).