What does the data rights policy mean for AI models trained on proprietary data? Would these models need to be restricted to data right holders?
In talking with folks about this, there seemed to be a distinction between AI models that simply use proprietary data to create derived products, i.e., (images → redshift would be fine) ; but models that reconstruct images would not be fine. I want to confirm this.
Specifically: Would releasing a model that generates LSST-like images pre-trained on proprietary LSST images be allowed? This is probably the most-on-the-nose case.
As detailed in the Rubin Data Policy (RDO-013), proprietary data may not be published for two years after they are released to the data rights holder community. Derived data products (DDP) generated from the proprietary data may be published and shared with non data rights holders. The chief distinction of a DDP is that the proprietary data can not be regenerated from the DDP. Assuming an AI model that produces an LSST like image can not be used to regenerate the LSST image identically, then such a deep fake (apologies) could be shared. A key point is that generated images can be identified as such. Therefore, any such image generated and shared should be clearly identifiable as produced from an AI or other model. I expect that we will review this aspect of the data policy as more LSST data make their way into training sets.