Obs package design

KSK · June 14, 2017, 5:11pm

Following is a cut at how we may try to bring the obs packages under control. This is not meant to make the obs packages “right” just to make them more homogeneous and easier to implement from scratch. I’m hoping for lively conversation to allow us to crystalize on a new design we can implement as a focused hack week later in the cycle.

What do obs packages currently contain?

Calibration information (Not including calibration images)
- Linearity
- Defects
- Electronics (gain, read noise, overscan region, serial numbers, etc.)
- Camera geometry
- Crosstalk
Instrument specific data manipulation tools
- E.g. Native defect format --> DefectList
Instrument specific task configuration overrides
Instrument specific task subclasses
- IngestTask
- IngestCalibTask
- IsrTask
CameraMapper subclass, specifically map, std, bypass methods.
Dataset definitions

Issues

The std_*, map_* and bypass_* functions in the CameraMapper are documented in the CameraMapper class, but not in the subclasses. This leads to cargo culting of possibly incorrect usage.
Calibration primacy and reproduceability is not obvious. It is not always clear what should be used for calibrations or where the calibration data came from originally. There’s also the question of how to keep code and calibrations up to date with each other.
Conflation of calibration information with code configuration is a problem because they change on different time scales and because one is a function of the data acquisition and the other is closer to a runtime decision.
The Mapper is in limbo in the sense that it doesn’t belong concretely in either the DAX team or SciPi team sphere of responsibility.
Ad hoc treatment: e.g. each obs package is using a different mechanism to transform calibration information from native format to the format needed by the stack.
The bi-temporal problem – There is no way currently to specify any combination of calibration products and code to apply the products: i.e. “reduce data as if it was 1995” and “rereduce data taken in 1995 with the latest and greatest” are the two extremes.

Proposal

Split current obs packages into two git repositories each

Calibration repository: This will be a git(-lfs) repository containing all calibration data. The repository will also contain code and tests to allow generation of the calibration repository at scons time.
Configuration repository: This will be a git repository of largely configuration information: e.g. dataset definitions, config overrides, Mapper subclasses. TBD is where the raw data ingest task overrides live. They could find a home in either repository.

Provide defined mechanisms for manipulating and ingesting calibration data.
Document clearly the non-calibration information. We should provide a cookbook for how to generate an obs package. This means clearly documenting which pieces are commonly (or necessarily) overridden.

Calibration part

all calibration-like data in native format goes into a git repository specifically for holding these data.
the calibration repository is built at scons time from the data in native format to solve the primacy issue
discoverability is handled by valid date ranges in the calibration repository
the calibration repository will be append only: i.e all versions of the calibration products will exist in the repo.
The bi-temporal problem is naturally addressed by this design. At any time, a calibration repository of the entire history of the calibration products can be generated from the native formats. Git tags will need to be used to keep track of changes in how the calibrations are applied by e.g. ip_isr.
obs_base will provide an ABC Task that will have the methods necessary for building the calibration repository. This may require coming up with a way to map calibs to valid ranges.

class BuildCalibRepoTask(object):
    def run(self):
        self.make_defects.run()
        self.ingest_defects.run()
        self.make_linearity.run()
        self.ingest_linearity.run()
        ...

Note We could add the image like calibration data via multiple parents.

Non-calibration part

This is mostly documentation.

Document what the “magic” methods do and how to use them.
Move as many dataset definitions to obs_base and purge those not needed
Document the process of subclassing the ingest tasks
Identify common config overrides. Document required config overrides.
Document required VisitInfo attributes. This will involve a bit of policy making. I.e. what to do when a needed piece of VisitInfo is missing for a particular algorithm. This policy should be enforced in code where possible.

Links

https://jira.lsstcorp.org/browse/RFC-341

https://jira.lsstcorp.org/browse/DM-4624

I’m sure there are more…

jbosch · June 15, 2017, 7:33am

All of this looks at least reasonable to me. I’ve got a few scattered comments:

I think it might be a worthwhile exercise to try to enumerate the different kinds of calibration data and identify where the raw data and code to build the processed versions will live. I suspect we’ll find some things where it isn’t clear whether they should go in a git(-lfs) repo, and I’d like to have some idea how versioning of the things that don’t go in that git repo will relate to the versioning of things that do (and how versioning of raw calibration products relates to versioning of processed calibration products).
Some of the Butler/Registry ideas I’ve been playing with to solve some SuperTask problems may also give us an opportunity to clean up some aspects of how we specialize the pipeline for different cameras. This is not really fleshed out at all yet, but I think it could really simplify how mappers are defined. One big part of that would be having “ingest” steps for domains other than raw data - I’d like to see us ingest cameras into repositories before ingesting raw data or calibrations from those cameras, and I’d like to make SkyMap definition more of an ingest step as well. While that’s very closely related to what you’ve proposed here, I don’t think it really conflicts at this high level, and I don’t want waiting for that to get in the way of you moving forward with this proposal. There’s a very preliminary sketch of the larger system here, with the huge caveat that it only covers SkyMap-based data IDs right now, and it’d be easy to get the wrong impression about how I want to handle camera specialization and camera-based data IDs from what’s there.

RHL · June 15, 2017, 2:47pm

I agree that we need to split the obs_XXX packages into two parts, but I don’t think that this is quite the split that I’d choose. We need to move the data out of the obs_ package, but probably not all into git as it needs to be versioned like any other calibration data – I don’t think that there’s a bi-temporal problem here (except when you evolve your recommended calibration products, but we can handle that via a different calibration root).

We need to specify formats for camera descriptions. We currently use pex_config and I think that this is a bit of an abuse. I believe that we should use to a yaml-based camera description (I have prototyped this for ComCam and CTIO0m9). I totally agree that we need to agree upon a format for defects; I think that the ascii one I used for HSC (based on what I used for SDSS) is probably good enough, but let’s take a careful look. This doesn’t have to be the format used by the pipelines – as Simon implied, HSC compiles the defects to binary table on build. I don’t have a proposal for linearity, but I think that almost always it should be ascii. I don’t understand the make/ingest distinction in Simon’s notes.

I’d also put all the code in what Simon calls a “configuration repository” – but I’d just call it the obs package as now. I’d count the mapper overrides (the current .paf files – or now I think that they’re .yaml) as “code”.

I absolutely agree that we need cleanup/docs. Simon mentions VisitInfo, and I think it needs a bit of work (e.g. remove the need to add boilerplate copied from the docstrings to use the base class). Some of this cleanup will be clearer when we take one of the clean cameras such as comCam where I actively fixed some of the bugs that led to the workarounds. We will doubtless find more changes in the butler/Mapper (and, unlike Simon, I think it’s clear that they live together – although I agree that it isn’t DAX’s job to support individual cameras).

I’m also not sure that each package needs its own ingestion scripts. We have never standardised the contents of the sqlite databases but I think we need to do so (presumably with code in obs_base) with overrides in the obs_ package. These sorts of data (extra columns in the registry…) should go with the code.