How can we have data packages that depend on `obs_*` not get re-installed?

The validation_data_* packages each depend on their respective obs_* package.

E.g.,
validation_data_decam depends on obs_decam
validation_data_cfht on obs_cfht
validation_data_hst on obs_subaru

This dependency makes logical sense because one can’t even read the data repository without a camera mapper file. However, the implementation in our lsstsw build system (and potentially eups?) installs a new copy of the validation_data_* package if ?something? determines the dependencies have change sufficiently enough to require a new package. This isn’t what we want.

Presently people tend to just manually remove the redundant installed versions. But this is actually a problem because it would break going back and trying to set up a specific build that used the previously-installed copy. We also have the case that these validation_data_* packages will legitimately change, that’s why we’re suffering through the pain of version-controlling them in the first place.

What would be a good solution?

  1. Create a new setupDataset or similar category in the table file?
  2. Remove the listed dependency in the validation_data_* table files and just figure that the user knows to set up the obs_* package.

The way it works with afwdata is to make it an optional dependency on the thing that uses it. This lets it be masked out of binary builds but does allow people who really want the data locally to build it fine. The testdata then only gets rebuilt as needed. I don’t know whether people think that making the obs_ packages depend on their test data is a good thing in general as it’s obviously not scalable as more and more data gets included.

Most devs build with lsstsw which installs all optional dependencies. That would be crushing for the validation_data_* datasets. The number of people who want obs_subaru is many times greater than the number who want to download validation_data_hsc.

Except we can easily add them to the default mask in the lsstsw repo. (obviously asking people to git pull to update quickly would be good).

I think Tim is a little confused: the data depends on the obs_ package, not the other way around. This is not data to test the obs_ package; it is data to test the algorithms.

The problem here is the strict dependency model we use when building eups packages. In this case, the validation data depends only on the “back-end” interface generated by the obs_ package, so a new version of the data package should only be generated if it requires a new version of obs_ to read it (which typically means that the data package has changed as well).

I think the ultimate solution to this and to other package mix-and-match issues may be to move away from strict dependencies, in at least limited cases.

I don’t think I was confused. I was suggesting flipping the dependency. I was throwing ideas out there to drive discussion ;-). I didn’t realize there was a code dependency between validation data and the obs package.

Allowing EUPS to easily setup a standalone data package with some other EUPS package would be great. Can you do that already?

@timj Thanks for the thoughts. I learned something.

There’s no code dependency, but there’s a potential dependency in terms of policies, registries, and other components of repository structure.

The strict dependency model is embedded in lsstsw. There’s no particular issue with eups features, just in how we use it.

So maybe we just say validation data doesn’t depend on obs_ packages at all? Build them separately and then learn how to get EUPS to set them both up.

I think when we update an obs_ package (or any lower-level pipeline code) we do generally want to rebuild test data packages - to the extent that rebuilding means re-running tests. We at least want to flag them as not-rebuilt, though our current system doesn’t distinguish between that and “should be rebuilt”.

To me it seems like the problem is that these packages are expected to be installed after being rebuilt. I think it might make sense to have a category of packages that are simply never installed. I think would just work for lsstsw, where the version that would be reused would just be the source distribution in the build directory, and as @ktl noted this is already not a problem for eups distrib install (if you don’t mind running these tests manually) because we leave optional deps out of the manifest.

While there is a code (or policy file, registry) dependency, it’s barely worth tracking, because we can’t go back historically in an easy way.

Because we don’t pin to versions, I wouldn’t know how to, e.g., load March 2016 version of validation_data_cfht output repo, which needs to be read by March 2016 version of obs_cfht.

Oh, yes. That’s a good point. The validation_data_* repos shouldn’t really be installed. It’s just a purely-redundant copy of the data.

But we do want to be able to setup the validation_data_* packages, so they do need to be “installed” in some sense; I.e., there needs to be an understanding that they need to be kept around.

If we published/distributed validation_data_cfht, a known-to-be working set of versions would be recorded in the manifest.

As repositories develop more structure managed by the Butler, some information will move into the repository itself, so no versioning is necessary. But there will always be some code for mappers and the Butler itself; it’s possible that the Butler should try to keep track of their versions and compatibility.

What if we had an entirely different system for installing data packages? They already have to use git lfs, which makes them different in that sense, and we’ve just established that we don’t want to install and version them the same way we do everything else. Something that more automatically does what we do manually for lsstsw (e.g. install from source, link into build dir, pull when needed) and puts them in a well-defined location seems like it would be very useful. When the butler is ready to handle respositories-of-repostories, it could also register the data packages with that, too.

It might even grow into something like what the DAX team would have had to build anyway for public users downloading LSST data repositories.

Effectively you are registering data with the butler using a standard directory structure. You don’t want to use EUPS at all.

@jbosch suggestion of lsstsw simply not installing a product might be fairly easy to implement. A flag could be added to repos.yaml to instruct lsst_build to clone a repo and run setup without ever building it.

To be a bit more pedantic about why this topic has come up, validation_data_hsc is almost 700GiB. That means an lsstsw build consumes twice that for the clone in the build dir + installed eups product. Any change in the dep chain (eg, afw) results in another 700GiB version being installed (which takes hours to copy even on a fast filesystem).

I filed DM-4637 based on an idea like this: a flag to cause certain large packages to just be symlinked.