@parejkoj, @ctslater and I were discussing subsectioning repositories today. What I mean is creating from a full repository another repository that only has the data associated with some of the datasets in the original repository.
An example is that I have a large repo that contains calexps, coadds, and all the source files. It would be nice to be able to create another repository with just the coadds so I can copy it over to another location for further analysis.
A better example is that we ran 1200 visits for the Twinkles project. To do anything with the catalogs, you also need the calexps because you need the Calib and Wcs to do anything useful. Since all this was done on remote clusters, SLAC and NERSC, you have to do any analysis there.
It seems like this is just a tooling issue and that the tool would be fairly simple. I could imagine something like this: $> selectDatasets.py input_repo --id --datasets 'deepCoadd' 'calexp' --output output_repo
In terms of implementation, I donât think this is much more than a butler.get on the input repo followed by a butler put on the output repo. The place where this gets tricky is for source catalogs because you want other info to go with the catalogs: e.g. Calib objects and calexp_md. The butler doesnât have the dependency information, so we might need a bit of additional information, ârecipesâ, for subsectioning on some dataset types, but I donât think itâs that many. My only worry is that I donât know how the butler handles butler.put('calexp_md').
Possible near term solutions
For catalogs we could prioritize the ingestion scripts. That would probably suffice especially if we could easily put things in a sqlite database that we could hand out.
Figure out how to persist items on their own that are currently persisted with very heavy things like exposures.
Make butler repositories remotely accessible.
I think the only option that solves all the issues mentioned in the thread is the second one. I agree that the third option makes things much easier, but it still means we need to have an internet connection and we need to deal with authentication if we want to give the data away.
Since I was the cause of this, Iâll just chime in that I feel like weâll want to be able to pass around catalogs, without their associated images (my particular use case).
If we wre dealing with an original flavor git repo subtree or spare checkout would be an option. I havenât tried subtree but sparse checkout seems to choke with lfs. Eg.
@josh I think a fundamental problem is that our persistence format does not allow for some of the data to be missing. I.e. one canât persiste calexp metadata without persisting everything associated with a calexp. In order for something like this to work, we have to be able to persist only part of the associated graph of datasets.
The nascent ideas about how to deal with composite datasets might allow something like this. Iâm a little more worried about how to extract any necessary parts of parent repositories. Your idea of reading from one repo and writing to a new one doesnât necessarily work because the Butler is currently assuming that the input repo (which automatically becomes a parent to the output repo) is always available. So this will take some work.
There have certainly been times when Iâve wanted to subsection a raw data repository, mostly to put a smaller test dataset on my laptop for testing while I traveled. The need for me to do that has largely gone away with official test datasets like ci_hsc.
Similarly, I suspect the need for subsectioning output data products as well could be mitigated (but not eliminated) by making remote data analysis easier.
Just a note on this, right now we do have scripts to ingest a source catalog into a database, but we donât have well-developed functionality to build the Exposures table to go along with it. Ingestion itself is not a solution until we have a defined persistence format for that metadata outside of a calexp, be it a table schema or a file format. I donât know if anyone has written down what the Exposure table should look like?
We have a nominal schema for various flavors of Exposure table (at the visit, CCD, or even amplifier level), but many of the columns have been awaiting Science Pipelines input for what needs/is desired to be stored.
Iâm not comfortable making a testing suite dependent on a network connection or remote database being live. That seems like a recipe for tests failing for strange reasons. Similarly, it seems silly to me that I have to include the images (a few GB) when Iâm making a test data set containing a source catalog (a few MB total).
I completely agree. Iâm hoping weâd eventually get to a steady state where we already have predefined test datasets that are sufficient for any new tests we write (this is what I want afwdata to become), and while weâre clearly not there, itâs not clear to me that the number of test datasets will be large enough to require a lot of tooling work, but will be if we donât have a coordinated effort to define test datasets that are useful for multiple kinds of tests.
That said, Level 3 users are obviously going to want to download subsets of productions, so weâll definitely need that tooling eventually.
I think that the question being posed here is a bit more general than butler sub sectioning (but @ctslater and @ktlâs responses about Exposure tables come close to the real answer).
The problem that @parejkoj faced is (based on out-of-band discussions) that jointcal needs the properties of the sources to do its work, but because it also needs to know about the visits they were measured it also needs some metadata about them. Currently the only way to get this is to read the âsrcâ and âcalexp_mdâ datasets, and manually create the needed metadata objects (e.g. Wcs and Filter) from the metadata. The butler will soon (?) be able to return an âcalexp_exposureInfoâ (sp?) instead of the _md and thatâll be much better, but as currently planned you still need the full calexp to get the calexp_md/calexp_exposureInfo.
So the real problem is that the src tables are not self-contained. We could âfixâ this for the LSST pipelines by creating composite data products (e.g. split the Exposure into MaskedImage and ExposureInfo on write and reconstitute on read) and this would be an improvement; in fact, I think itâd solve Johnâs immediate problem. I wrote âfixâ as weâll need to solve the larger problem at some point, namely how we export data (and sorry, this almost certainly means FITS) to the community. We might say that this can be postponed until ComCam data starts flowing, but I donât think we can afford to wait that long. One option would be to denormalise the information, and write the ExposureInfo to the Source tables and provide stand-alone code to read them if theyâre complex enough, e.g. the astrometric transforms and Psf (if included; I donât think weâll need it. We could obviously choose to write an approximate ExposureInfo (e.g. a FITS WCS and PSF image at the centre of the image) to the source tables, and that might be OK too.
Johnâs situation: an LSST dev wants to work on catalog-level processing without downloading lots of images.
Giving LSST catalogs to a non-LSST-savvy person. This case is currently broken, because they need the Calib object from the images and other image metadata.
Giving LSST images to a non-LSST-savvy person. This case is currently ok, but could be broken by solutions to #1 or #2
Robertâs solution (splitting out a separate ExposureInfo file) handles case #1, and I think if we wrote it out in a reasonably intelligible file format then it could also handle case #2. That is, if the ExposureInfo persistence was a YAML file, then I think the non-LSST-savvy user wouldnât grumble too much about having to look in there for the zero point and filter, etc. (though more complex entities may still be opaque). If ExposureInfo was a completely opaque blob that required the stack to read at all, then that might not satisfy #2.
Where I think some denormalization is inevitable is that the calexps will still need to have WCS and some basic metadata, otherwise we would break case #3 and our own usage of things like ds9. So as long as that issue can be navigated, I think the overall state would be improved with an ExposureInfo file.
Actually, I meant to come down on the side of denormalising and making our source tables useful to non-stack-users. Sorry to have obfuscated that opinion.