Schema evolution: use registry, or use well-known place?

roy · July 17, 2024, 7:37am

Reading the Rubin alert packets requires a schema that is not part of the packet, but rather found elsewhere. The schema can be fetched automatically by the reading code through a “schema registry” (SR option), or fetched by a human from somewhere, perhaps a git repo, and copied into the right place of the consumer code (GIT option). It is the nature of schema that they are occasionally changed, and that all parties should be given adequate notice when such changes happen.

SR means that the registry is a critical resourse: if millions of alerts are cached by a consumer, they cannot be read without the registry being up and running. I note that the current registry at slac.stanford.edu has been down for a week (since July 11), so that no software development can be done by brokers during that time. I also note that if that software has automated execution of unit tests, nothing can run when the registry is down. Given recent unreliability of the SR, any SR should be at least mirrored in several places.

A consistency problem arises if the schema is kept in other places – the consumer worries that the versions are different. In particular, I note that the GIT version of the 7.1 alert schema has doc strings, but the schema from the SR does not have these. Notice that the provision of SR mirrors adds to the worries about inconsistency.

One of the advantages of the SR is that changes can be slipped in and the reading software will magically parse the new packets correctly, without any human in the loop. However, there may well be subsequent processing that fails in spite of this. For example, if the attribute filterName is changed to band in the schema, everything will be great until downstream, when the packet is inserted in a database table, or used to make a plot. Whether we use SR or GIT, we humans must be properly informed of upcoming changes, which seems to rather preclude a main advantage of SR.

For myself, I would be happy to forget about confluent_kafka.DeserializingConsumer and go back to fastavro.schemaless_reader.

rseaman · July 17, 2024, 7:36pm

If only an organization were created to handle international astronomical software standards and infrastructure.

ebellm · July 18, 2024, 11:44pm

Hi @roy,

Apologies for the downtime of the Schema Registry; it was a casualty of the USDF power outage last week and uncovered a couple of failure mode our team hadn’t seen before. We’re continuing to improve our operational procedures as we approach the start of live alerts. I agree in particular that the AP team will need to provide more detailed advanced notice of schema changes.

Fundamentally, the alert schemas used by the Science Pipelines are held under change control in https://github.com/lsst/alert_packet; these in turn derive fairly directly from the AP science data model schemas (“sdm_schemas”) which can be browsed here. We have made the alert schemas pip installable as lsst-alert-packet; this hasn’t been kept up to date but will be going forward.

The Confluent Schema Registry currently deployed as part of our alert distribution system provides some niceties but also its own challenges: as you indicate, it must be kept synchronized with the actual schemas in use, and it is a service which must be kept running. Additionally, we have recognized a technical challenge with creating and maintaining a consistent mapping of identifiers between the git-managed alert schemas and those in the Schema Registry. We must resolve this in the coming months and will provide further guidance once we have identified a path forward.

roy · July 19, 2024, 7:35am

Thank you Eric