Potential bug/race condition with butler: duplicate key value violates unique constraint "pg_type_typname_nsp_index"

Hi all,

I encountered an error when trying to perform parallel butler ingest-raws and butler define-visits commands with disjoint collections of a butler repository, using a single postgres instance as the registry. I am running, as an example:

# node 1
$ butler ingest-raws {repo} {files} --output-run DECam/raw/science/210318 -j 28
# node 2
$ butler ingest-raws {repo} {files} --output-run DECam/raw/science/210321 -j 28
# node 3
$ butler ingest-raws {repo} {files} --output-run DECam/raw/science/210324 -j 28

and encountering errors like:

# node 1
lsst.ingest INFO: Successfully extracted metadata from 33 files with 0 failures
lsst.daf.butler.cli.utils ERROR: Caught an exception, details are in traceback:
Traceback (most recent call last):
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "pg_type_typname_nsp_index"
DETAIL:  Key (typname, typnamespace)=(dataset_tags_00000004, 2200) already exists.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/obs_base/gcaa7f91c06+32dde6e53f/python/lsst/obs/base/cli/cmd/commands.py", line 138, in ingest_raws
    script.ingestRaws(*args, **kwargs)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/obs_base/gcaa7f91c06+32dde6e53f/python/lsst/obs/base/script/ingestRaws.py", line 85, in ingestRaws
    ingester.run(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/utils/g63a1f4f1ec+494d45cfd3/python/lsst/utils/timer.py", line 339, in timeMethod_wrapper
    res = func(self, *args, **keyArgs)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/obs_base/gcaa7f91c06+32dde6e53f/python/lsst/obs/base/ingest.py", line 1183, in run
    new_refs, bad, n_exp, n_exp_fail, n_ingest_fail = self.ingestFiles(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/obs_base/gcaa7f91c06+32dde6e53f/python/lsst/obs/base/ingest.py", line 1013, in ingestFiles
    self.butler.registry.registerDatasetType(self.datasetType)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/daf_butler/ge8d8315da9+47328fee21/python/lsst/daf/butler/registries/sql.py", line 381, in registerDatasetType
    _, inserted = self._managers.datasets.register(datasetType)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/daf_butler/ge8d8315da9+47328fee21/python/lsst/daf/butler/registry/datasets/byDimensions/_manager.py", line 282, in register
    tags = self._db.ensureTableExists(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/daf_butler/ge8d8315da9+47328fee21/python/lsst/daf/butler/registry/interfaces/_database.py", line 1020, in ensureTableExists
    table.create(connection)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/sql/schema.py", line 950, in create
    bind._run_ddl_visitor(ddl.SchemaGenerator, self, checkfirst=checkfirst)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2113, in _run_ddl_visitor
    visitorcallable(self.dialect, self, **kwargs).traverse_single(element)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/sql/visitors.py", line 524, in traverse_single
    return meth(obj, **kw)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/sql/ddl.py", line 893, in visit_table
    self.connection.execute(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1289, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/sql/ddl.py", line 80, in _execute_on_connection
    return connection._execute_ddl(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1381, in _execute_ddl
    ret = self._execute_context(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1845, in _execute_context
    self._handle_dbapi_exception(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2026, in _handle_dbapi_exception
    util.raise_(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pg_type_typname_nsp_index"
DETAIL:  Key (typname, typnamespace)=(dataset_tags_00000004, 2200) already exists.

[SQL: 
CREATE TABLE public.dataset_tags_00000004 (
	dataset_type_id BIGINT NOT NULL, 
	dataset_id UUID, 
	collection_id BIGINT, 
	instrument VARCHAR(16) NOT NULL, 
	detector BIGINT NOT NULL, 
	exposure BIGINT NOT NULL, 
	PRIMARY KEY (dataset_id, collection_id), 
	CONSTRAINT dataset_tags_00000004_unq_dataset_type_id_collection_i_0ba9af29 UNIQUE (dataset_type_id, collection_id, instrument, detector, exposure), 
	CONSTRAINT fkey_dataset_tags_00000004_dataset_type_id_dataset_type_id FOREIGN KEY(dataset_type_id) REFERENCES public.dataset_type (id), 
	CONSTRAINT fkey_dataset_tags_00000004_dataset_id_dataset_id FOREIGN KEY(dataset_id) REFERENCES public.dataset (id) ON DELETE CASCADE, 
	CONSTRAINT fkey_dataset_tags_00000004_collection_collection_id_co_a5ba3b28 FOREIGN KEY(collection_id) REFERENCES public.collection (collection_id) ON DELETE CASCADE, 
	CONSTRAINT fkey_dataset_tags_00000004_collection_summary_dataset__b377058f FOREIGN KEY(collection_id, dataset_type_id) REFERENCES public.collection_summary_dataset_type (collection_id, dataset_type_id), 
	CONSTRAINT fkey_dataset_tags_00000004_instrument_name_instrument FOREIGN KEY(instrument) REFERENCES public.instrument (name), 
	CONSTRAINT fkey_dataset_tags_00000004_collection_summary_instrume_5947d29e FOREIGN KEY(collection_id, instrument) REFERENCES public.collection_summary_instrument (collection_id, instrument), 
	CONSTRAINT fkey_dataset_tags_00000004_detector_instrument_id_inst_1d3d0e23 FOREIGN KEY(instrument, detector) REFERENCES public.detector (instrument, id), 
	CONSTRAINT fkey_dataset_tags_00000004_exposure_instrument_id_inst_0237f157 FOREIGN KEY(instrument, exposure) REFERENCES public.exposure (instrument, id)
)

]
(Background on this error at: https://sqlalche.me/e/14/gkpj)

These errors duplicate over nodes 1-3. The postgres database reports simultaneous errors in its logs:

# node running postgresql
2022-02-09 20:22:03.319 PST [19028] ERROR:  duplicate key value violates unique constraint "pg_type_typname_nsp_index"
2022-02-09 20:22:03.319 PST [19028] DETAIL:  Key (typname, typnamespace)=(dataset_tags_00000004, 2200) already exists.
2022-02-09 20:22:03.319 PST [19028] STATEMENT:  
	CREATE TABLE public.dataset_tags_00000004 (
		dataset_type_id BIGINT NOT NULL, 
		dataset_id UUID, 
		collection_id BIGINT, 
		instrument VARCHAR(16) NOT NULL, 
		detector BIGINT NOT NULL, 
		exposure BIGINT NOT NULL, 
		PRIMARY KEY (dataset_id, collection_id), 
		CONSTRAINT dataset_tags_00000004_unq_dataset_type_id_collection_i_0ba9af29 UNIQUE (dataset_type_id, collection_id, instrument, detector, exposure), 
		CONSTRAINT fkey_dataset_tags_00000004_dataset_type_id_dataset_type_id FOREIGN KEY(dataset_type_id) REFERENCES public.dataset_type (id), 
		CONSTRAINT fkey_dataset_tags_00000004_dataset_id_dataset_id FOREIGN KEY(dataset_id) REFERENCES public.dataset (id) ON DELETE CASCADE, 
		CONSTRAINT fkey_dataset_tags_00000004_collection_collection_id_co_a5ba3b28 FOREIGN KEY(collection_id) REFERENCES public.collection (collection_id) ON DELETE CASCADE, 
		CONSTRAINT fkey_dataset_tags_00000004_collection_summary_dataset__b377058f FOREIGN KEY(collection_id, dataset_type_id) REFERENCES public.collection_summary_dataset_type (collection_id, dataset_type_id), 
		CONSTRAINT fkey_dataset_tags_00000004_instrument_name_instrument FOREIGN KEY(instrument) REFERENCES public.instrument (name), 
		CONSTRAINT fkey_dataset_tags_00000004_collection_summary_instrume_5947d29e FOREIGN KEY(collection_id, instrument) REFERENCES public.collection_summary_instrument (collection_id, instrument), 
		CONSTRAINT fkey_dataset_tags_00000004_detector_instrument_id_inst_1d3d0e23 FOREIGN KEY(instrument, detector) REFERENCES public.detector (instrument, id), 
		CONSTRAINT fkey_dataset_tags_00000004_exposure_instrument_id_inst_0237f157 FOREIGN KEY(instrument, exposure) REFERENCES public.exposure (instrument, id)
	)

And here are the actual commands run and logs on each node:

node 1
$ butler ingest-raws /gscratch/scrubbed/stevengs/repo /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_193147_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_193241_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_193335_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_193429_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_193523_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_193618_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_193712_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_193806_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_193900_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_193954_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_194048_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_194759_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_194835_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_194911_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_194948_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_195024_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_195100_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_195137_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_195213_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_195249_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_195325_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_195401_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_195437_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_195542_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_195646_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_195750_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_195854_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_195958_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_200103_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_200207_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_200311_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_200415_fri.fits.fz /gscratch/scrubbed/stevengs/decam_ddf/nights/210402/calibs/c4d_210402_200520_fri.fits.fz --transfer link --output-run DECam/raw/bias/210402 -j 28
lsst.ingest INFO: Successfully extracted metadata from 33 files with 0 failures
lsst.daf.butler.cli.utils ERROR: Caught an exception, details are in traceback:
Traceback (most recent call last):
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "pg_type_typname_nsp_index"
DETAIL:  Key (typname, typnamespace)=(dataset_tags_00000004, 2200) already exists.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/obs_base/gcaa7f91c06+32dde6e53f/python/lsst/obs/base/cli/cmd/commands.py", line 138, in ingest_raws
    script.ingestRaws(*args, **kwargs)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/obs_base/gcaa7f91c06+32dde6e53f/python/lsst/obs/base/script/ingestRaws.py", line 85, in ingestRaws
    ingester.run(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/utils/g63a1f4f1ec+494d45cfd3/python/lsst/utils/timer.py", line 339, in timeMethod_wrapper
    res = func(self, *args, **keyArgs)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/obs_base/gcaa7f91c06+32dde6e53f/python/lsst/obs/base/ingest.py", line 1183, in run
    new_refs, bad, n_exp, n_exp_fail, n_ingest_fail = self.ingestFiles(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/obs_base/gcaa7f91c06+32dde6e53f/python/lsst/obs/base/ingest.py", line 1013, in ingestFiles
    self.butler.registry.registerDatasetType(self.datasetType)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/daf_butler/ge8d8315da9+47328fee21/python/lsst/daf/butler/registries/sql.py", line 381, in registerDatasetType
    _, inserted = self._managers.datasets.register(datasetType)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/daf_butler/ge8d8315da9+47328fee21/python/lsst/daf/butler/registry/datasets/byDimensions/_manager.py", line 282, in register
    tags = self._db.ensureTableExists(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/daf_butler/ge8d8315da9+47328fee21/python/lsst/daf/butler/registry/interfaces/_database.py", line 1020, in ensureTableExists
    table.create(connection)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/sql/schema.py", line 950, in create
    bind._run_ddl_visitor(ddl.SchemaGenerator, self, checkfirst=checkfirst)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2113, in _run_ddl_visitor
    visitorcallable(self.dialect, self, **kwargs).traverse_single(element)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/sql/visitors.py", line 524, in traverse_single
    return meth(obj, **kw)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/sql/ddl.py", line 893, in visit_table
    self.connection.execute(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1289, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/sql/ddl.py", line 80, in _execute_on_connection
    return connection._execute_ddl(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1381, in _execute_ddl
    ret = self._execute_context(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1845, in _execute_context
    self._handle_dbapi_exception(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2026, in _handle_dbapi_exception
    util.raise_(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pg_type_typname_nsp_index"
DETAIL:  Key (typname, typnamespace)=(dataset_tags_00000004, 2200) already exists.

[SQL: 
CREATE TABLE public.dataset_tags_00000004 (
	dataset_type_id BIGINT NOT NULL, 
	dataset_id UUID, 
	collection_id BIGINT, 
	instrument VARCHAR(16) NOT NULL, 
	detector BIGINT NOT NULL, 
	exposure BIGINT NOT NULL, 
	PRIMARY KEY (dataset_id, collection_id), 
	CONSTRAINT dataset_tags_00000004_unq_dataset_type_id_collection_i_0ba9af29 UNIQUE (dataset_type_id, collection_id, instrument, detector, exposure), 
	CONSTRAINT fkey_dataset_tags_00000004_dataset_type_id_dataset_type_id FOREIGN KEY(dataset_type_id) REFERENCES public.dataset_type (id), 
	CONSTRAINT fkey_dataset_tags_00000004_dataset_id_dataset_id FOREIGN KEY(dataset_id) REFERENCES public.dataset (id) ON DELETE CASCADE, 
	CONSTRAINT fkey_dataset_tags_00000004_collection_collection_id_co_a5ba3b28 FOREIGN KEY(collection_id) REFERENCES public.collection (collection_id) ON DELETE CASCADE, 
	CONSTRAINT fkey_dataset_tags_00000004_collection_summary_dataset__b377058f FOREIGN KEY(collection_id, dataset_type_id) REFERENCES public.collection_summary_dataset_type (collection_id, dataset_type_id), 
	CONSTRAINT fkey_dataset_tags_00000004_instrument_name_instrument FOREIGN KEY(instrument) REFERENCES public.instrument (name), 
	CONSTRAINT fkey_dataset_tags_00000004_collection_summary_instrume_5947d29e FOREIGN KEY(collection_id, instrument) REFERENCES public.collection_summary_instrument (collection_id, instrument), 
	CONSTRAINT fkey_dataset_tags_00000004_detector_instrument_id_inst_1d3d0e23 FOREIGN KEY(instrument, detector) REFERENCES public.detector (instrument, id), 
	CONSTRAINT fkey_dataset_tags_00000004_exposure_instrument_id_inst_0237f157 FOREIGN KEY(instrument, exposure) REFERENCES public.exposure (instrument, id)
)

]
(Background on this error at: https://sqlalche.me/e/14/gkpj)
node 2

For the define visits I see (one example with actual commands):

$ butler define-visits /gscratch/scrubbed/stevengs/repo lsst.obs.decam.DarkEnergyCamera --collections DECam/raw/science/210526/cosmos_1 -j 28
lsst.defineVisits INFO: Preprocessing data IDs.
lsst.defineVisits INFO: Registering visit_system 0: one-to-one.
lsst.defineVisits INFO: Grouping 5 exposure(s) into visits.
lsst.defineVisits INFO: Computing regions and other metadata for 5 visit(s).
numexpr.utils INFO: Note: NumExpr detected 28 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
numexpr.utils INFO: NumExpr defaulting to 8 threads.
lsst.ctrl.mpexec.cmdLineFwk INFO: QuantumGraph contains 310 quanta for 1 tasks, graph ID: '1644903530.1639528-7870'
lsst.daf.butler.cli.utils ERROR: Caught an exception, details are in traceback:
Traceback (most recent call last):
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line
 1802, in _execute_context
    self.dialect.do_execute(
  File "/gscratch/astro/stevengs/lsst_stacks/stacks/w.2022.06/env/lsst-w.2022.06/lib/python3.8/site-packages/sqlalchemy/engine/default.py", l
ine 732, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "pg_type_typname_nsp_index"
DETAIL:  Key (typname, typnamespace)=(dataset_tags_00000000, 2200) already exists.

I haven’t seen the error in any pipetask run commands yet, but I am keeping an eye out.

This error looks different from what I might get if ingesting duplicate raws into the same collection, where I thought the error might be coming from. As an example:

$ butler ingest-raws {repo} {files} --transfer link --output-run DECam/raw/science/210318 -j 28
...
lsst.ingest WARNING: Failed to ingest the following for reason: A database constraint failure was triggered by inserting one or more datasets of type DatasetType('raw', {band, instrument, detector, physical_filter, exposure}, Exposure) into collection 'DECam/raw/science'. This probably means a dataset with the same data ID and dataset type already exists, but it may also mean a dimension row is missing.
lsst.ingest WARNING: - file:///epyc/data/decam_ddf/nights/210318/objects/cosmos_1/c4d_210319_022302_ori.fits.fz
lsst.ingest INFO: Successfully processed data from 0 exposures with 0 failures from exposure registration and 2 failures from file ingest.
lsst.ingest INFO: Ingested 0 distinct Butler datasets
lsst.daf.butler.cli.utils ERROR: Caught an exception, details are in traceback:
Traceback (most recent call last):
  File "/epyc/projects/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/obs_base/gcaa7f91c06+32dde6e53f/python/lsst/obs/base/cli/cmd/commands.py", line 138, in ingest_raws
    script.ingestRaws(*args, **kwargs)
  File "/epyc/projects/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/obs_base/gcaa7f91c06+32dde6e53f/python/lsst/obs/base/script/ingestRaws.py", line 85, in ingestRaws
    ingester.run(
  File "/epyc/projects/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/utils/g63a1f4f1ec+494d45cfd3/python/lsst/utils/timer.py", line 339, in timeMethod_wrapper
    res = func(self, *args, **keyArgs)
  File "/epyc/projects/lsst_stacks/stacks/w.2022.06/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/obs_base/gcaa7f91c06+32dde6e53f/python/lsst/obs/base/ingest.py", line 1227, in run
    raise RuntimeError("Some failures encountered during ingestion")
RuntimeError: Some failures encountered during ingestion

I am wondering if this is resulting from:

  1. Unaccounted for race condition in pipeline code
  2. Postgres bug
  3. Postgres config that can be changed to handle parallel writes (e.g. enable_parallel_hash? or is there something special that ctrl_bps uses?)
  4. User error

I realize this use case might be out of scope for the pipelines, since I’m not using ctrl_bps to handle multi-node processing. Instead, I am using multiple nodes to process disjoint subsets of the data into separate butler collections.

Thanks,
Steven

Thank you for the report. We have filed this as DM-33684. When you first ingest a file a table is created. It’s likely that if you ingest a single file and then try your multi-process multi-node ingest then things will work. I believe that @FabioHernandez has done parallelized ingest at IN2P3.

If you are doing ingests of the same dataset multiple times you can make things somewhat faster by pre-calculating the metadata. He has documented his findings in RTN-029.

I’m not really sure what’s going on with this one. There are only 5 visits so parallelism is not really necessary. The duplicate key constraint is an interesting error. Do you get it without the -j option if you rerun?

This looks wrong to me. pipetask and quantum graphs should not be involved.

This is likely. Database table creation is notoriously hard to handle across parallel attempts and we mostly don’t notice because the table is only created for the very first file being ingested.

ctrl_bps only writes to the registry once the entire workflow completes. We are not getting reports of problems with these merge jobs running in parallel during DP0.2 reprocessing.

Thanks Tim, I’m glad I posted then if it is a bug needed resolution.

I will have to do testing and a re-run to see about -j with define-visits. I think the bug definitely did occur when running define-visits on multiple nodes (and not ingest-raws). In my tests, I have stdout and stderr going to different files, so perhaps I’ve mixed up which commands are associated with which logs.

We have come to the conclusion that running define-visits in parallel is a bad idea. It’s entirely an in-database operation so running it in one place single-threaded should be fine and shouldn’t take very long.

1 Like

That’s good to know. Shouldn’t be a problem for me to define visits on a single node with -j 1 after I ingest the raws. I switched from a multi-node raws ingestion to single node ingestion with -j 28 because I saw these errors with that too.

As you can see from the IN2P3 document, running ingest-raws in parallel is a reasonable thing to do. We think your specific problem in this case is that you had not yet ingested anything and so the table had to be created by each of the jobs. If you had ingested one file and then done the multi-node ingest then that should work fine.

Just a note to say that after reading through, I’ve marked the latest reply from @timj as the solution to this issue. If that’s not quite right or if it’s actually still open, please just say.

I just ran into this (or at least something very similar) as well, and was about to make a new forum thread about it before finding this existing one. I’m using v23_0_1 of the LSST pipelines. The following command seems to leave me in an unending (thus far, ~13 hours in) loop of sqlite3 error messages:

LOGFILE=$LOGDIR/define_visits.log; \
date | tee $LOGFILE; \
butler define-visits $REPO lsst.obs.decam.DarkEnergyCamera \
-j 20 \
2>&1 | tee -a $LOGFILE; \
date | tee -a $LOGFILE

The repeating error messages in my log file look like:

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
(Background on this error at: https://sqlalche.me/e/14/e3q8)
Process ForkPoolWorker-20:
Traceback (most recent call last):
  File "/data0/lsst_stack_v23_0_1/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-0.8.1/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/data0/lsst_stack_v23_0_1/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-0.8.1/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: database is locked

I am running on a linux server with 40 CPU’s (no multi-node or distributed aspect to what I’m doing). I don’t think my Butler repo has any fancy or unusual aspects — it’s just 260 raw DECam exposures from one single observing night. I’ve attached the log file (first 20k lines), in case that’s of any interest. In any case I’ll take Tim’s advice from earlier in this thread to just not run define-visits in parallel from here on out.

define_visits.log (1.8 MB)

Yeah, especially with SQLite, doing parallel things — especially 20-wide — may be asking for trouble.

Defining visits should normally be quite fast, I’m not sure why you’d want to do it in parallel. On top of that, defining visits usually wants to group together dataIds, so you really shouldn’t be splitting it up anyway.

1 Like

Thanks! Defining visits with SQLite in serial (no -j argument) took right about 1 minute for the set of 260 DECam exposures that I’m currently testing on. I want to eventually process tens of thousands of DECam exposures with the LSST pipelines, so I’m generally just trying out -j wherever it’s an accepted argument to see how it might become of use as I scale up.

If you want to get to tens of thousands, I’m guessing you’ll want to break it up into smaller pieces (because it might scale as N log N) by adding a data query (e.g. by observation day/night). But those smaller pieces should still almost certainly be executed in series with SQLite, not in parallel.

1 Like