Migrating PosgreSQL repositories to UUIDs

This is a brief guide for migrating PosgreSQL repositories to UUID dataset IDs based on experience obtained when we migrated few repositories at NCSA (see this post).

Migration from integer IDs to UUIDs is not strictly necessary but it is advisable to switch repository to UUID if you plan to import contents of other repositories. Depending on size of the repository it can take significant time to migrate, for example migration of largest NCSA repository (dc2 with ~30 million records) took more than three hours. Migration process generally requires exclusive access to database, it is safest to ask DBAs to limit access to a database to just a single user account which is going to perform this migration (doing ALTER ROLE user NOLOGIN for all other accounts).

This particular migration executes a series of ALTER TABLE ... statements which in PostgreSQL require user account executing them to be the owner of the modified table (or PostgreSQL superuser). Butler registry contains multiple tables created on demand by users adding new data which leads to a mixed ownership. For successful migration all tables that undergo migration need to be transferred to an account doing migration. The list of tables that are updated by this migration includes:

  • dataset, dataset_location, dataset_location_trash, file_datastore_records
  • all table names starting with dataset_calibs_
  • all table names starting with dataset_tags_

The command to change ownership of a table is ALTER TABLE table_name OWNER TO new_role. Alternatively the role that is executing migration commands can be given superuser privileges with command ALTER ROLE role SUPERUSER , all of the ALTER commands obviously need to be executed by DBA or any other superuser role.

Actual migration is performed by a tool which is a part of the daf_butler_migrate package which is not a part of LSST stack so it needs to be cloned separately:

git clone git@github.com:lsst-dm/daf_butler_migrate.git
cd daf_butler_migrate
source some-path/loadLSST.bash
setup -r .

The command that performs all migration-related operations is butler migrate. To check current status of the repository versions as recorded in the registry you can run this command, repo is the location of the butler.yaml:

$ repo=/path/to/butler.yaml
$ butler --log-level=warning migrate show-current $repo --butler
attributes: lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager 1.0.0 -> f22a777cf382
collections: lsst.daf.butler.registry.collections.nameKey.NameKeyCollectionManager 2.0.0 -> 93341d68b814
datasets: lsst.daf.butler.registry.datasets.byDimensions._manager.ByDimensionsDatasetRecordStorageManager 1.0.0 -> 635083325f20
datastores: lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager 0.2.0 -> a07b3b60e369
dimensions: lsst.daf.butler.registry.dimensions.static.StaticDimensionRecordStorageManager 6.0.0 -> 87a30df8c8c5
opaque: lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager 0.2.0 -> 77e5b803ad3f

The scope of this UUID migration is datasets manager, and its name and version has to be as shown above ByDimensionsDatasetRecordStorageManager 1.0.0. Other managers may have different versions from shown above, this should not matter for UUID migration.

Migration tool is based on Alembic which requires presence of one special table in the database. If this is the first ever migration performed on a database then this special table needs to be created by stamp command:

$ butler --log-level=warning migrate stamp $repo

After this command one can check revision numbers in alembic table:

$ butler --log-level=warning migrate show-current $repo
a07b3b60e369 (head)
77e5b803ad3f (head)
93341d68b814 (head)
635083325f20 (head)
f22a777cf382 (head)
87a30df8c8c5 (head)

(the output will match revisions from previous command with --butler option).

This UUID migration is a special what is called “one-shot” migration which changes type of the datasets manager from ByDimensionsDatasetRecordStorageManager 1.0.0 to ByDimensionsDatasetRecordStorageManagerUUID 1.0.0, revision number for latter is 2101fbf51ad3. The command that performs migration is (redirect its output to a log file in case something goes wrong):

$ butler --log-level=debug --long-log migrate upgrade --one-shot-tree datasets/int_1.0.0_to_uuid_1.0.0 $repo 2101fbf51ad3 |& tee migration.log

The migration can print some warning messages, please ignore. This command can take significant time if the database is large. It is executed in a single database transaction, if anything fails during the migration then the state is reverted to original.

After this command completes one can check the new version numbers:

$ butler --log-level=warning migrate show-current $repo --butler
attributes: lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager 1.0.0 -> f22a777cf382
collections: lsst.daf.butler.registry.collections.nameKey.NameKeyCollectionManager 2.0.0 -> 93341d68b814
datasets: lsst.daf.butler.registry.datasets.byDimensions._manager.ByDimensionsDatasetRecordStorageManagerUUID 1.0.0 -> 2101fbf51ad3
datastores: lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager 0.2.0 -> a07b3b60e369
dimensions: lsst.daf.butler.registry.dimensions.static.StaticDimensionRecordStorageManager 6.0.0 -> 87a30df8c8c5
opaque: lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager 0.2.0 -> 77e5b803ad3f

which shows that datasets manager is recorded as ByDimensionsDatasetRecordStorageManagerUUID 1.0.0 in the database. This change also needs to be reflected in butler.yaml file. Edit the file and replace ByDimensionsDatasetRecordStorageManager with ByDimensionsDatasetRecordStorageManagerUUID:

$ vi $repo
:1,$s/ByDimensionsDatasetRecordStorageManager/ByDimensionsDatasetRecordStorageManagerUUID/
:wq

To verify that everything works after migration you can run regular butler commands, e.g.:

$ butler query-datasets $repo --collections <collection> <dataset-type>

I have just merged a new subcommand butler migrate rewrite-sqlite-registry which will create an entirely new copy of the SQLite registry of a butler repository. This will effectively switch the registry from integers to UUIDs. The original files are backed up in case something goes wrong.

To use this script you will need master of daf_butler or wait for weekly 28 to be published.