Removing an exposure dimension record after ingestion

stevenstetzler · November 21, 2022, 10:47pm

I am using weekly w.2022.45 of the science pipelines. I’d like to remove an exposure dimension record from the Butler registry. I am ingesting a second version of a DECam fits file where some of the metadata (at the least exposure timespan) has changed, but exposure number is the same. This produces a conflict when ingesting with butler ingest-raws:

lsst.ingest WARNING: Exposure DECam:ct4m20140331t051219 could not be registered: Conflict in sync for table exposure on column(s) timespan: Timespan(begin=astropy.time.Time('2014-03-31 05:12:54.091401', scale='tai', format='iso'), end=astropy.time.Time('2014-03-31 05:14:24.091401', scale='tai', format='iso')) != Timespan(begin=astropy.time.Time('2014-03-31 05:12:54.091401', scale='tai', format='iso'), end=astropy.time.Time('2014-03-31 05:15:18.000000', scale='tai', format='iso')).

I can’t find a good way to remove a dimension record from the registry, no butler remove-dimension-record or in Python no butler.registry.removeDimensionRecord(). How should I go about this?

As an aside, I am finding the ingestion code extremely opaque to find and search though for relevant configs that would set e.g. update=True in butler.Registry.syncDimensionData that I think would solve my issue.

Long log hints I should start looking at line 1182 of ingest.py:

WARNING 2022-11-21T13:54:40.498-08:00 lsst.ingest ()(ingest.py:1182) - Exposure DECam:ct4m20140331t051219 could not be registered: Conflict in sync for table exposure on column(s) timespan: Timespan(begin=astropy.time.Time('2014-03-31 05:12:54.091401', scale='tai', format='iso'), end=astropy.time.Time('2014-03-31 05:14:24.091401', scale='tai', format='iso')) != Timespan(begin=astropy.time.Time('2014-03-31 05:12:54.091401', scale='tai', format='iso'), end=astropy.time.Time('2014-03-31 05:15:18.000000', scale='tai', format='iso')).

But I can’t find a relevant ingest.py (or lsst.ingest package) on GitHub. Additionally, I can’t find any source code for DecamParseTask (github and docs) which I was hoping would give me insight into how the ingestion code is working.

Does anyone have tips or links with a high level overview of how fits ingestion (through dataset registration with the registry) is done in the science pipelines?

Thanks.

timj · November 21, 2022, 11:04pm

Do you know how this is possible?

It’s in obs_base:

github.com

lsst/obs_base/blob/main/python/lsst/obs/base/ingest.py#L1254

      
        
            def run(
                self,
                files: Iterable[ResourcePathExpression],
                *,
                pool: Optional[PoolType] = None,
                processes: int = 1,
                run: Optional[str] = None,
                file_filter: Union[str, re.Pattern] = r"\.fit[s]?\b",
                group_files: bool = True,
                skip_existing_exposures: bool = False,
                update_exposure_records: bool = False,
                track_file_attrs: bool = True,
            ) -> List[DatasetRef]:
                """Ingest files into a Butler data repository.
            
            
    This creates any new exposure or visit Dimension entries needed to
                identify the ingested files, creates new Dataset entries in the
                Registry and finally ingests the files themselves into the Datastore.
                Any needed instrument, detector, and physical_filter Dimension entries
                must exist in the Registry before `run` is called.

The problem is that we don’t make that update option available in the butler ingest-raws command at the moment because it seemed like a very dangerous thing to do and we wanted to have a think before adding it. The current advice is to run the Task directly for the file that you want to override the existing definition. You can see the code here:

github.com

lsst/obs_base/blob/main/python/lsst/obs/base/script/ingestRaws.py#L91-L94

      
        
            ingester = TaskClass(config=ingestConfig, butler=butler)
            ingester.run(
                locations, run=output_run, processes=processes, file_filter=regex, track_file_attrs=track_file_attrs
            )

That is Gen2 code that has now been deleted. The metadata translation system changed for gen3 and the DECam translator is at:

github.com

lsst/astro_metadata_translator/blob/main/python/astro_metadata_translator/translators/decam.py

# This file is part of astro_metadata_translator.
#
# Developed for the LSST Data Management System.
# This product includes software developed by the LSST Project
# (http://www.lsst.org).
# See the LICENSE file at the top-level directory of this distribution
# for details of code ownership.
#
# Use of this source code is governed by a 3-clause BSD-style
# license that can be found in the LICENSE file.

"""Metadata translation code for DECam FITS headers"""

from __future__ import annotations

__all__ = ("DecamTranslator",)

import logging
import posixpath
import re

This file has been truncated. show original

If the second file has an incorrect header for some reason it is possible to specify a header override file for it. You can also run the astrometadata write-sidecar command and then edit the translated content before running ingest. It all partly depends on knowing which file had the right header information.

It’s all generic code. We use astro_metadata_translator to extract the relevant exposure record information and then create the exposure records and ingest the files into butler.

stevenstetzler · November 21, 2022, 11:55pm

Thanks, Tim for the helpful information. This is exactly what I needed to do. Here’s what I ran:

from lsst.daf.butler import Butler
from lsst.obs.base import ingest
from lsst.pipe.base.configOverrides import ConfigOverrides
_butler = Butler("./repo", writeable=True)
config = ingest.RawIngestConfig()
configOverrides = ConfigOverrides()
configOverrides.applyTo(config)
task = ingest.RawIngestTask(config=config, butler=_butler)
task.run(["./data/images/mar31/object/c4d_140331_051219_ori.fits.fz"], run="mar31/raw/object", skip_existing_exposures=True, update_exposure_records=True)

One file I downloaded from the NOIRLab image archive was corrupted/unreadable. They re-ingested the original raw data into the archive under a different name with new tooling (updating some headers) and deleted the old file. It looks like this updated the exposure timespan. The diff on the fits headers is (diff old new):

< DATE-OBS= '2014-03-31T05:12:19.091401'  /  UTC epoch                            
---
> DATE-OBS= '2014-03-31T05:12:19.091401' / UTC epoch                              
36c36
< DTPI    = 'Heinze            '  /  Principal Investigator                       
---
> DTPI    = 'Heinze  '           / Principal Investigator                         
131,139c131,168
< CHECKSUM= 'nDg2q9Z2nAf2n9Z2'   / HDU checksum updated 2022-11-16T12:09:25       
< DATASUM = '0         '         / data unit checksum updated 2022-11-16T12:09:25 
< DTSITE  = 'ct                '  /  observatory location                         
< DTTELESC= 'ct4m              '  /  telescope identifier                         
< DTINSTRU= 'decam             '  /  instrument identifier                        
< DTCALDAT= '2014-03-30        '  /  calendar date from observing schedule        
< ODATEOBS= '                  '  /  previous DATE-OBS                            
< DTPROPID= '2014A-0496        '  /  observing proposal ID                        
< DTACQNAM= '/data_local/images/DTS/2014A-0496/DECam_00297828.fits.fz'  /  file na
---
> CHECKSUM= 'a6Maa6JZa6Jaa6JY'   / HDU checksum updated 2022-11-21T13:47:36       
> DATASUM = '0       '           / data unit checksum updated 2022-11-21T13:47:36 
> DTSITE  = 'ct      '           / observatory location                           
> DTTELESC= 'ct4m    '           / telescope identifier                           
> DTINSTRU= 'decam   '           / instrument identifier                          
> DTCALDAT= '2014-03-30'         / calendar date from observing schedule          
> ODATEOBS= '2014-03-31T05:12:19.091401' / previous DATE-OBS                      
> DTUTC   = '2014-03-31T05:14:43'  /  post exposure UTC epoch from DTS            
> DTOBSERV= 'NOAO    '           / scheduling institution                         
> DTPROPID= '2014A-0496'         / observing proposal ID                          
> DTPIAFFL= '                  '  /  PI affiliation                               
> DTTITLE = '                  '  /  title of observing proposal                  
> DTCOPYRI= 'AURA    '           / copyright holder of data                       
> DTACQUIS= 'pipeline4.ctio.noao.edu'  /  host name of data acquisition computer  
> DTACCOUN= 'sispi             '  /  observing account name                       
> DTACQNAM= '/data_local/images/DTS/2014A-0496/DECam_00297828.fits.fz' / file na  
> DTNSANAM= 'c4d_140331_051219_ori.fits.fz' / file name in NOAO Science Archive   
> DT_RTNAM= 'c4d_140331_051443_ori'  /  NSA root name                             
> DTQUEUE = 'decam             '  /  DTS queue (17555)                            
> DTSTATUS= 'done              '  /  data transport status                        
> SB_HOST = 'pipeline4.ctio.noao.edu'  /  iSTB client host                        
> SB_ACCOU= 'sispi             '  /  iSTB client user account                     
> SB_SITE = 'ct                '  /  iSTB host site                               
> SB_LOCAL= 'dec               '  /  locale of iSTB daemon                        
> SB_DIR1 = '20140330          '  /  level 1 directory in NSA DS                  
> SB_DIR2 = 'ct4m              '  /  level 2 directory in NSA DS                  
> SB_DIR3 = '2014A-0496        '  /  level 3 directory in NSA DS                  
> SB_RECNO=               165255  /  iSTB sequence number                         
> SB_ID   = 'dec165255         '  /  unique iSTB identifier                       
> SB_NAME = 'c4d_140331_051443_ori.fits'  /  name assigned by iSTB                
> SB_RTNAM= 'c4d_140331_051443_ori'  /  NSA root name                             
> RMCOUNT =                    0  /  remediation counter                          
> RECNO   =               165255  /  NOAO Science Archive sequence number         
> COMMENT MODIFIED:DATE-OBS,DTACQNAM,DTCALDAT,DTCOPYRI,DTINSTRU,DTNSANAM,DTOBSERV,
> COMMENT DTPI,DTPIAFFL,DTPROPID,DTSITE,DTTELESC,DTTITLE,INSTRUME,OBSERVAT,OBSID,O
> COMMENT BSTYPE,ODATEOBS,PROCTYPE,PRODTYPE,PROPID,SIMPLE,TELESCOP,TIME-OBS       
> HISTORY Applied DTCALDATfromDATEOBSchile which added/modified fields (set()). Ol
> HISTORY d values were:

Looking at the metadata translator, it seems like the presence of the DTUTC header in the new vs old file changes how the exposure duration is defined.

stevenstetzler · November 22, 2022, 12:17am

And for future readers, since I ran butler define-visits, I additionally had to run defineVisits.py and adding update_records=True

task.run(
    butler.registry.queryDataIds(
        ["exposure"],
        dataId={"instrument": instr.getName()},
        collections=collections,
        datasets=raw_name,
        where=where,
    ),
    collections=collections,
    update_records=True,
)