Duplicate forcedSourceOnDiaObjectID in DP1 (Butler Parquet, Butler SQL, HATS)

DESC is finding that the HATS version of DP1 contains duplicate forcedSourceOnDiaObjectIDs . These duplicates have subsequently also been found in the Bulter SQL and the Butler Parquet files i.e. they do not appear to be an artifact in the process of creation of the HATS files.

e.g. forcedSourceOnDiaObjectId = 600386155389124609

has four entries in dia_object_forced_source

diaObjectId  parentObjectId   coord_ra  coord_dec  ...  invalidPsfFlag  tract patch  forcedSourceOnDiaObjectId
208   648368125565206530               0  38.060371   7.411335  ...           False  10463    90         600386155389124609
4529  648375547268694017               0  38.172620   7.433099  ...           False  10464    98         600386155389124609
3527  650018080201637889               0  38.189089   7.510324  ...           False  10704     0         600386155389124609
1517  650025570624602120               0  38.265854   7.446826  ...           False  10705     9         600386155389124609

and are duplicated in four separate Parquet files:

file:///sdf/group/rubin/repo/dp1/LSSTComCam/runs/DRP/DP1/v29_0_0/DM-50260/20250419T073356Z/dia_object_forced_source/10705/9/dia_object_forced_source_10705_9_lsst_cells_v1_LSSTComCam_runs_DRP_DP1_v29_0_0_DM-50260_20250419T073356Z.parq

file:///sdf/group/rubin/repo/dp1/LSSTComCam/runs/DRP/DP1/v29_0_0/DM-50260/20250419T073356Z/dia_object_forced_source/10464/98/dia_object_forced_source_10464_98_lsst_cells_v1_LSSTComCam_runs_DRP_DP1_v29_0_0_DM-50260_20250419T073356Z.parq

file:///sdf/group/rubin/repo/dp1/LSSTComCam/runs/DRP/DP1/v29_0_0/DM-50260/20250419T073356Z/dia_object_forced_source/10463/90/dia_object_forced_source_10463_90_lsst_cells_v1_LSSTComCam_runs_DRP_DP1_v29_0_0_DM-50260_20250419T073356Z.parq

file:///sdf/group/rubin/repo/dp1/LSSTComCam/runs/DRP/DP1/v29_0_0/DM-50260/20250419T073356Z/dia_object_forced_source/10704/0/dia_object_forced_source_10704_0_lsst_cells_v1_LSSTComCam_runs_DRP_DP1_v29_0_0_DM-50260_20250419T073356Z.parq

For now we can match on the tuple of DIAObjectID, DIASourceID, but this behavior seems unexpected, and will be a significant issue with a larger area survey than DP1.

And to stress, these duplicates have been found in three places, butler parquet files, butler SQL tables, and HATS files.

Non-unique IDs will make database searches much more complicated. How are users to query for a specific forced source?

2 Likes

Using DP1 HATS with LSDB I’ve found that ~20% of all forcedSourceOnDiaObjectId are duplicated.

I’m speculating: perhaps the computation of forced sources is getting duplicated in the patch overlaps, even though the DIAObjects themselves are deduplicated. A good way to check would be to plot the spatial positions of the duplicates and see if they present in a grid pattern.