How to subsection a butler data repository

KSK · June 16, 2016, 9:29pm

@parejkoj, @ctslater and I were discussing subsectioning repositories today. What I mean is creating from a full repository another repository that only has the data associated with some of the datasets in the original repository.

An example is that I have a large repo that contains calexps, coadds, and all the source files. It would be nice to be able to create another repository with just the coadds so I can copy it over to another location for further analysis.

A better example is that we ran 1200 visits for the Twinkles project. To do anything with the catalogs, you also need the calexps because you need the Calib and Wcs to do anything useful. Since all this was done on remote clusters, SLAC and NERSC, you have to do any analysis there.

It seems like this is just a tooling issue and that the tool would be fairly simple. I could imagine something like this:
$> selectDatasets.py input_repo --id --datasets 'deepCoadd' 'calexp' --output output_repo

In terms of implementation, I don’t think this is much more than a butler.get on the input repo followed by a butler put on the output repo. The place where this gets tricky is for source catalogs because you want other info to go with the catalogs: e.g. Calib objects and calexp_md. The butler doesn’t have the dependency information, so we might need a bit of additional information, ‘recipes’, for subsectioning on some dataset types, but I don’t think it’s that many. My only worry is that I don’t know how the butler handles butler.put('calexp_md').

Possible near term solutions

For catalogs we could prioritize the ingestion scripts. That would probably suffice especially if we could easily put things in a sqlite database that we could hand out.
Figure out how to persist items on their own that are currently persisted with very heavy things like exposures.
Make butler repositories remotely accessible.

I think the only option that solves all the issues mentioned in the thread is the second one. I agree that the third option makes things much easier, but it still means we need to have an internet connection and we need to deal with authentication if we want to give the data away.

parejkoj · June 16, 2016, 9:39pm

Since I was the cause of this, I’ll just chime in that I feel like we’ll want to be able to pass around catalogs, without their associated images (my particular use case).

josh · June 16, 2016, 10:01pm

If we wre dealing with an original flavor git repo subtree or spare checkout would be an option. I haven’t tried subtree but sparse checkout seems to choke with lfs. Eg.

mkdir calib
cd calib
git init
git remote add origin https://github.com/lsst/testdata_subaru.git
git config core.sparseCheckout true
echo "hsc/calib/" >> .git/info/sparse-checkout
git pull --depth=1 origin master

KSK · June 16, 2016, 10:25pm

@josh I think a fundamental problem is that our persistence format does not allow for some of the data to be missing. I.e. one can’t persiste calexp metadata without persisting everything associated with a calexp. In order for something like this to work, we have to be able to persist only part of the associated graph of datasets.

ktl · June 17, 2016, 3:35am

The nascent ideas about how to deal with composite datasets might allow something like this. I’m a little more worried about how to extract any necessary parts of parent repositories. Your idea of reading from one repo and writing to a new one doesn’t necessarily work because the Butler is currently assuming that the input repo (which automatically becomes a parent to the output repo) is always available. So this will take some work.

mwv · June 17, 2016, 2:44pm

@KSK

I believe you are referring to a Butler data repository?

And are not referring to a github repository.

Could you clarify the original title and question?

price · June 17, 2016, 3:10pm

Perhaps the use case could be satisfied by the ability to access a remote data repo?

jbosch · June 17, 2016, 3:37pm

There have certainly been times when I’ve wanted to subsection a raw data repository, mostly to put a smaller test dataset on my laptop for testing while I traveled. The need for me to do that has largely gone away with official test datasets like ci_hsc.

Similarly, I suspect the need for subsectioning output data products as well could be mitigated (but not eliminated) by making remote data analysis easier.

KSK · June 17, 2016, 4:24pm

I tried to clarify a little more and lay out the options mentioned in the thread.

ctslater · June 17, 2016, 4:30pm

Just a note on this, right now we do have scripts to ingest a source catalog into a database, but we don’t have well-developed functionality to build the Exposures table to go along with it. Ingestion itself is not a solution until we have a defined persistence format for that metadata outside of a calexp, be it a table schema or a file format. I don’t know if anyone has written down what the Exposure table should look like?

ktl · June 17, 2016, 4:47pm

We have a nominal schema for various flavors of Exposure table (at the visit, CCD, or even amplifier level), but many of the columns have been awaiting Science Pipelines input for what needs/is desired to be stored.

KSK · June 17, 2016, 5:31pm

How is that need for input captured, and what is the plan for getting that input? It’s certainly not part of the DPDD discussion.

parejkoj · June 17, 2016, 6:02pm

I’m not comfortable making a testing suite dependent on a network connection or remote database being live. That seems like a recipe for tests failing for strange reasons. Similarly, it seems silly to me that I have to include the images (a few GB) when I’m making a test data set containing a source catalog (a few MB total).

jbosch · June 17, 2016, 6:06pm

I completely agree. I’m hoping we’d eventually get to a steady state where we already have predefined test datasets that are sufficient for any new tests we write (this is what I want afwdata to become), and while we’re clearly not there, it’s not clear to me that the number of test datasets will be large enough to require a lot of tooling work, but will be if we don’t have a coordinated effort to define test datasets that are useful for multiple kinds of tests.

That said, Level 3 users are obviously going to want to download subsets of productions, so we’ll definitely need that tooling eventually.

RHL · July 7, 2016, 1:30pm

I think that the question being posed here is a bit more general than butler sub sectioning (but @ctslater and @ktl’s responses about Exposure tables come close to the real answer).

The problem that @parejkoj faced is (based on out-of-band discussions) that jointcal needs the properties of the sources to do its work, but because it also needs to know about the visits they were measured it also needs some metadata about them. Currently the only way to get this is to read the “src” and “calexp_md” datasets, and manually create the needed metadata objects (e.g. Wcs and Filter) from the metadata. The butler will soon (?) be able to return an “calexp_exposureInfo” (sp?) instead of the _md and that’ll be much better, but as currently planned you still need the full calexp to get the calexp_md/calexp_exposureInfo.

So the real problem is that the src tables are not self-contained. We could “fix” this for the LSST pipelines by creating composite data products (e.g. split the Exposure into MaskedImage and ExposureInfo on write and reconstitute on read) and this would be an improvement; in fact, I think it’d solve John’s immediate problem. I wrote “fix” as we’ll need to solve the larger problem at some point, namely how we export data (and sorry, this almost certainly means FITS) to the community. We might say that this can be postponed until ComCam data starts flowing, but I don’t think we can afford to wait that long. One option would be to denormalise the information, and write the ExposureInfo to the Source tables and provide stand-alone code to read them if they’re complex enough, e.g. the astrometric transforms and Psf (if included; I don’t think we’ll need it. We could obviously choose to write an approximate ExposureInfo (e.g. a FITS WCS and PSF image at the centre of the image) to the source tables, and that might be OK too.

ctslater · July 7, 2016, 5:20pm

I see three use cases here:

John’s situation: an LSST dev wants to work on catalog-level processing without downloading lots of images.
Giving LSST catalogs to a non-LSST-savvy person. This case is currently broken, because they need the Calib object from the images and other image metadata.
Giving LSST images to a non-LSST-savvy person. This case is currently ok, but could be broken by solutions to #1 or #2

Robert’s solution (splitting out a separate ExposureInfo file) handles case #1, and I think if we wrote it out in a reasonably intelligible file format then it could also handle case #2. That is, if the ExposureInfo persistence was a YAML file, then I think the non-LSST-savvy user wouldn’t grumble too much about having to look in there for the zero point and filter, etc. (though more complex entities may still be opaque). If ExposureInfo was a completely opaque blob that required the stack to read at all, then that might not satisfy #2.

Where I think some denormalization is inevitable is that the calexps will still need to have WCS and some basic metadata, otherwise we would break case #3 and our own usage of things like ds9. So as long as that issue can be navigated, I think the overall state would be improved with an ExposureInfo file.

RHL · July 7, 2016, 9:24pm

Actually, I meant to come down on the side of denormalising and making our source tables useful to non-stack-users. Sorry to have obfuscated that opinion.