Replacing dataset entry in butler database

jjkavelaars · September 4, 2024, 2:14pm

We have been running pipetask quantum backed butlers and merging datasets into our butler after processing.

When transferring from graph output into the butler db, we get the following error:

butler --log-level=ERROR --long-log transfer-from-graph --update-output-chain qgraphs-BUTLER-20240709/processed/processCcd330994_53.qgraph BUTLER-20240709 

lsst.daf.butler.registry._exceptions.ConflictingDefinitionError: Existing dataset type and dataId does not match new dataset: {'dataset_type_id': 33, 'instrument': 'HSC', 'detector': 53, 'exposure': 330994, 'dataset_id': UUID('cfb67d32-225b-43f3-8c16-2cd182988e77'), 'new dataset_id': UUID('ff3e8529-48eb-4611-8be4-e8edc4e7ed3f'), 'collection_id': 21, 'new collection_id': 21}

I suspect we have mistakenly processed the same dataset twice in the same ‘run’, and this has clobbered the previously produced datasets, and now the butter SQL database has the wrong data. This appears to have occurred for a few 100 datasets.

Is there a process to replace the db content with what is on disk?

Thanks,
JJ

timj · September 4, 2024, 6:38pm

What version of the software are you using? We changed that error message a year ago to make it clearer what the problem is.

jjkavelaars · September 4, 2024, 7:36pm

That error message is from v26.0.0.0 … here is the full dump:

(lsst-scipipe-7.0.1) [HSC_2024] (559) $ butler --log-level=ERROR --long-log transfer-from-graph --update-output-chain qgraphs-BUTLER-20240709/processed/processCcd33089 BUTLER-20240709 
processCcd330892_95.qgraph  processCcd330898_90.qgraph  
(lsst-scipipe-7.0.1) [HSC_2024] (559) $ butler --log-level=ERROR --long-log transfer-from-graph --update-output-chain qgraphs-BUTLER-20240709/processed/processCcd330892_95.qgraph BUTLER-20240709 
ERROR 2024-09-04T19:35:45.717+00:00 lsst.daf.butler.cli.utils ()(utils.py:1127) - Caught an exception, details are in traceback:
Traceback (most recent call last):
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-7.0.1/Linux64/pipe_base/g8798d61f7d+6612571a14/python/lsst/pipe/base/cli/cmd/commands.py", line 61, in transfer_from_graph
    number = script.transfer_from_graph(**kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-7.0.1/Linux64/pipe_base/g8798d61f7d+6612571a14/python/lsst/pipe/base/script/transfer_from_graph.py", line 97, in transfer_from_graph
    transferred = dest_butler.transfer_from(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-7.0.1/Linux64/daf_butler/gaa4f23791d+8ca47f5a75/python/lsst/daf/butler/_butler.py", line 2468, in transfer_from
    imported_refs = self._registry._importDatasets(refs_to_import, expand=False)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-7.0.1/Linux64/daf_butler/gaa4f23791d+8ca47f5a75/python/lsst/daf/butler/core/utils.py", line 55, in inner
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-7.0.1/Linux64/daf_butler/gaa4f23791d+8ca47f5a75/python/lsst/daf/butler/registries/sql.py", line 651, in _importDatasets
    refs = list(storage.import_(runRecord, expandedDatasets))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-7.0.1/Linux64/daf_butler/gaa4f23791d+8ca47f5a75/python/lsst/daf/butler/registry/datasets/byDimensions/_storage.py", line 668, in import_
    self._validateImport(tmp_tags, run)
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-7.0.1/Linux64/daf_butler/gaa4f23791d+8ca47f5a75/python/lsst/daf/butler/registry/datasets/byDimensions/_storage.py", line 812, in _validateImport
    raise ConflictingDefinitionError(
lsst.daf.butler.registry._exceptions.ConflictingDefinitionError: Existing dataset type and dataId does not match new dataset: {'dataset_type_id': 24, 'instrument': 'HSC', 'detector': 95, 'visit': 330892, 'dataset_id': UUID('df18597d-afb7-4b00-b4fd-aabf25a32620'), 'new dataset_id': UUID('c24460c0-2ec2-489a-a60f-87dfb7e05241'), 'collection_id': 21, 'new collection_id': 21}

timj · September 4, 2024, 9:22pm

The error message is now:

github.com

lsst/daf_butler/blob/main/python/lsst/daf/butler/registry/datasets/byDimensions/_storage.py#L1116-L1120

      
        
            raise ConflictingDefinitionError(
                f"Dataset with type {self.datasetType.name!r} and data ID {data_id} "
                f"has ID {row.dataset_id} in existing collection {existing_collection!r} "
                f"but ID {row.new_dataset_id} in new collection {new_collection!r}."
            )

You would get an error like this if you regenerated the graph but used the same RUN collection. The UUIDs would change so the system no longer could tell that the dataset was already present. If you reuse a RUN you now have to use the same graph. If you regenerate the graph it has to be a new RUN. pipetask update-graph-run can be used to update the run and UUIDs in an existing graph.

v26 is old enough that I am not really sure what situation you are in.