More support for configurable PipelineTask connections

jbosch · May 12, 2023, 1:19am

After some misadventures involving git revert, I’ve just merged DM-38953, which provides a lot more support for PipelineTask connections that depend on configuration for more than just the dataset type name. This includes changing task dimensions, dataset type dimensions, storage classes, connection types (e.g. you can change an Input into a PrerequisiteInput), and creating completely new connections in __init__.

It all works through regular attribute assignment and deletion syntax:

class SomeConnections(PipelineTaskConnections, dimensions=()):
    a = Input(...)
    b = Input(...)

    def __init__(self, *, config):
        # Remove an existing connection.
        del self.a
        # Replace an existing connection.
        self.b = PrerequisiteInput(...)
        # Add a brand new connection.
        self.c = Output(...)  # a totally new connection
        # Change the task dimensions.
        self.dimensions.update({"patch", "band"})

Some additional notes:

Delegating to super().__init__ is harmless, but it now does nothing - the first step of initialization happens in the metaclass.
Removing a connection via e.g. self.inputs.remove("a") still works, and we have no plans to drop support for it, but we prefer the del self.a approach as more intuitive and readable. We don’t currently see any reason for new code to interact with the inputs, outputs, initinputs, etc. sets at all, but they’re all still there for backwards compatibility.
The dimensions attribute on connections objects is a set that may be modified in-place or replaced with another set-like object. After __init__ it will be turned into a frozenset (as are the inputs, outputs, etc. sets).
The only breaking changes were to connections classes that were assigning to the self.allConnections mapping in __init__ (which was never supported, but it was the only hack that worked for some problems before). This is now a read-only mapping view that is updated automatically when connection attributes are added, removed, or replaced.

kfindeisen · May 12, 2023, 7:18pm

Are there any shortcuts if you want to change one property of a connection, or do you have to create a new connection object and replace with it like in the examples above?

jbosch · May 12, 2023, 7:20pm

Yes, you do have to wholly replace them, since all connection objects are declared as dataclasses with frozen=True. But since they are dataclasses, the dataclasses.replace function is super convenient for this.

arunkannawadi · May 24, 2023, 3:42pm

Is the config here an instance of a ConfigClass used for a PipelineTask that uses SomeConnections?

jbosch · May 25, 2023, 2:02pm

Yes, exactly that.

arunkannawadi · November 20, 2023, 3:57pm

I would like to create a new dataset where part of the name comes from config and part of the name is specified in defaultTemplates. Is it possible to create new connections in __init__ with names where prefixes are specified in defaultTemplates and if so, how?

jbosch · November 20, 2023, 4:13pm

Each entry in defaultTemplates causes a field to be added to config.connections, so it should work to just look there in your connection’s __init__ implementation (which is passed the config) to do this.