In thinking about the parallelization and data flow for Data Release Production, it’s occurred to me that it could be very difficult to support processing data on laptops using the full pipeline in the near future. That’s because the pipeline is going to need to process large volumes of data at least somewhat simultaneously - for instance, we’ll be computing full focal-plane solutions for the WCS, PSF, and background model. My guess is that processing a full LSST focal plane will take around 50-60 GB of memory, which isn’t a problem on a big cluster (or even a single node of most new clusters), but it is more than we’ll be able to expect user laptops to have for quite a while. Other stages in processing, such as joint calibration (if tracts are large), background matching, or multifit (on very deep datasets) could be even larger.
I think there are a few ways we can deal with this:
Don’t even try to support certain kinds of processing on small machines. This could limit our ability to develop on airplanes and get complete test coverage on personal-developer machines, but I think you can make a case that no one should be trying to really run high-level pipeline code on laptops anyway.
Provide alternative pipelines that do lower-quality or less-robust processing than the full pipeline (e.g. by running CCDs in serial and not doing any full-focal-plane fitting), but are otherwise still compatible with downstream dependent pipelines in terms of the outputs they produce. I worry about the added cost of developing two similar-but-not-identical pipelines, even if this is basically what we’d want to do to support SExtractor-style use of the pipeline.
Provide a way to run the full pipeline on laptops, but much more slowly (via a lot of swapping to disk). This might require running under very different middleware, and hence I’m worried about the cost of new middleware development we wouldn’t otherwise have to do.
I think I have a preference for (1), but to avoid the biggest downsides of that I think we’ll need to work hard to make it easy to exactly reproduce small pieces of the pipeline from intermediate outputs (something we’ve at least vaguely referred to as “freeze-drying” in the past). That would allow these small pieces to at least be run on developer machines from intermediates produced on big machines. I suspect that even then we’d mostly still prefer to do this sort of development on machines that share a filesystem with the big machines (rather than laptops), but those could be (e.g.) Nebula systems that we can spin up immediately rather than the queue-based big machines themselves.
I think we have to do (3). I would expect that most if not all algorithms can and will be written so that the minimum executable piece requires an amount of RAM that is commonly available on a laptop or cheap server (and further parallelizes over that RAM using multiple cores). Communication between such minimal pieces to perform a larger algorithm can be done optimally on larger-memory machines but can also be done sub-optimally but not too slowly on even a single machine using persistent storage.
@robkooper is thinking about these issues and will have a proposal for public review in something less than a month’s time.
I’m quite happy to see that Jim is putting critical thought into this. I love that he seems to be driving progress on these questions. And I’m not intending to be a naysayer but … what exactly is the requirement for inter-operating with laptops? There may very well be one; I do not have the documentation memorized. If there is no accepted requirement, I would agree with Jim’s preference for (1) at least initially. (2) and (3) seem out of scope.
- seems like the correct approach here: developers should be writing unit/integration tests that can be run on small data volumes, while the validation data tests should only be run on large machines (for storage, if no other reason).
The pipeline could be run on laptops, but only on small data volumes (e.g. one or a few CCDs), which is generally plenty for testing purposes.
This sounds like my (2), actually - the nature of the processing is actually different if you’re just running a small number of CCDs.
Interesting. How is the nature of the processing different? Wouldn’t you just apply the same system over a smaller sky region? (please feel free to point me to a document, if you’ve already written this up somewhere.)
I guess the bigger question then is whether we want to provide any kind of “sextractor-like” processing utility. I think it would be a big loss to the community if there were no facility to use the stack on smaller quantities of data.
Once you’ve gotten to coadds, I think you’re right that different size regions on the sky aren’t qualitatively different (aside from perhaps the scale at which you can fit the background).
But full visits are very special unit of data. The clearest case is probably PSF modeling: a full focal plane model could depend on the wavefront sensors as well having data at different points in the focal plane, because there are modes of the instrument that produce coherent patterns over the focal plane.
Or for a more immediate problem, if we had a full-visit matcher for solving astrometry, I’d expect that to be vastly more robust than something that operates only on individual CCDs, because we have so many more bright stars to work with. And that could allow us to make optimizations in the matcher that just wouldn’t work if we were restricted to fitting only a single CCD at a time.
As for SExtractor-like processing, I do think we need to support something along these lines, but I think we need to start decoupling that use case from the
CmdLineTasks we use to drive survey-style processing of large, well-characterized datasets. I think we can provide both simple scripts for low-quality processing of completely generic images and a high-quality production pipeline for surveys, but don’t think we can continue to try to have the same piece of code do both extremes as well as everything in between.
The question here is what is the maximum size of a task instance; any larger processing would have to be composed of instances of that size or smaller. I’m suggesting three things:
- If the maximum size is about the size of a laptop, then the exact same code and algorithms can be executed (slowly) in development mode as in production mode, and debugging can be guaranteed to be looking at the same things that production would be.
- The smaller the maximum size, the more flexibility we have to run on different configurations. (One extreme but interesting use case is when the maximum size is no larger than the memory available per core even on a many-core system.)
- Most if not all full-focal-plane computations can be decomposed, and not entirely unnaturally, into smaller pieces along with some communication and central computation. I’m a little worried that allowing the maximum size to be too big will encourage monolithic codes that will execute inefficiently on future processors and clusters.
At the same time, if setting the maximum size too small is constraining the science codes and making them too difficult to write, and if debugging is not too much of a problem, then we should choose a different balancing point.
Here’s the specific case I’m most worried about. The way I see it, this can be decomposed as in @ktl’s (3), but that doesn’t decrease the maximum size below the laptop threshold unless we do some extra I/O to make that the case, and doing that extra I/O imposes serious constraints on how we express the parallelization in algorithms.
In single frame processing, we’ll do approximately these steps, repeatedly:
- Process each sensor independently. This requires storing the full sensor
MaskedImage and associated catalog in memory, including
Footprints (~200 MB for each sensor, or 37 GB total), as well as a bit more for workspace. In general, we should assume both the image and the catalog will be modified by this stage (as well as the image metadata).
- Transfer a small amount of information (~2-3 MB) from each sensor to a central process, and do some central computation (perhaps 1-2 GB of workspace and storage).
I expect we’d repeat this pattern anywhere from 5-50 times when processing a single visit, with most central-processing rounds requiring at most a few tens of seconds (and some may just be a few milliseconds).
On laptops, we could process only a few sensors at a time, and read/write their intermediates from/to disk every time we switch to a new batch of sensors. This is obviously slower because of the swapping, but it should do the exact same thing as the full pipeline. The big concerns for me are:
- I think this requires that we define butler datasets (or something similar) for a ton of algorithm intermediates; this seems rather fragile.
- We’d need to clutter up our algorithm code with I/O calls that most developers won’t think of as I/O.
- Instead of having the option of writing the top level of the per-sensor algorithm code as a single function with blocking calls to the central visit-level computation (which I think would be a lot more natural), we’d have to split it up into multiple functions so local variables can go out of scope.
- We’d need to add smarts to the middleware and configuration system to decide when those I/O calls should write to disk or write to memory.
In short, this sounds like it imposes something like pex_harness-style parallelization, in which we split up the algorithms into chunks that can only communicate via some broker. That’s the right decision for parallelization in the stack above some level, but I’d rather that didn’t extend down to parallelization of sensors within a visit.
Of course, writing the sensor-level algorithms as regular Python code that explicitly invokes parallelization also thwarts the sort of freeze-drying/resumability that I’ve also declared that I want (since we’re not telling the middleware enough about the intermediates), so I’m coming around to the idea that I can’t have it all. Unless the middleware people have something magical (in the Arthur C. Clarke sense) in mind that solves these problems in a way that I’m not seeing, maybe this is really a question for Science Pipelines people to decide what they’re willing to give up:
- Do we want algorithm code that reads and flows naturally, with parallelism expressed with language features like blocking function calls,
pool.map, or OpenMP-style for loops, if that means we can only run it on big machines and we have very low granularity for interrupt/resume - and hence we have to develop and test on the big machines, too?
- Or do we want to split even mid-level algorithms into data-parallel chunks run by the middleware, if that means we can run it on laptops and have high granularity for interrupt/resume?
I’d be curious to get @rhl’s and @price’s opinions on this in particular.
Oh, personally I think that there will always be a scale divide between what’s possible on clusters and what’s possible on an airplane, and I have no problem with that. I think we need to be able to do enough in airplane mode so people can do general development there, but I think it’s entirely legitimate that it be decomposed from the production pipeline. In the past it wasn’t possible for science developers to run anything except the full heavy-weight pipeline system, but we’ve recently moved to the other end of the spectrum where the pipeline system just runs the same scripts used by the science developers. I think we need to be somewhere in the middle so that we can support both efficient development and an efficient production pipeline. I believe that’s an important design goal of the so-called SuperTask project started by @gpdf, and I hope that will mature quickly to the point where we can start playing with it. No doubt @robkooper also has these things in mind as he dreams up a production system, and I’m looking forward to hearing his recommendations because it’s quite likely there are technologies out that we, as science developers, are unaware of but will make this problem easier.
Otherwise we’ll never get any work out of @RHL…
I think superTasks might be handy here.
@gpdf and I were wondering about that and are toying with the idea of using coroutines to allow tasks to be “paused” when a new parallelization step is needed and then return when that has happened (since the activator is the thing that knows how to to the parallelization). This is all up in the air at the moment so nothing about this is written down (just a whiteboard photo) and no proof of concept has been written.
The ideal is for scientist-programmers to write algorithms in a natural way, including communication primitives where they make sense. I was hoping (and suggested to RobK) that we could provide something like a “run in parallel” primitive as well as a “yield” primitive that would allow gathers in the middle of otherwise-parallel code. This sounds like it is tracking with the ideas of GregoryPDF and TimJ.
Problems with writing hard-coded parallelism include a lack of scalability and elasticity. Building a data-parallel system with finely-parallel components is possible but adds complexity. I’m hoping that it’s not inevitable.
This is an interesting question. My default opinion for big number crunching is yes, since you don’t want to have to rerun a bunch of stuff to get back to where you were if something fails. But I suspect that our worst case scenario for reprocessing after fixing a failure is probably on the order of minutes, so it’s less of a problem than, say, a big simulation.