The HSC pipeline team recently completed a production run on HSC Strategic Survey Program data acquired over the last two years. The Princeton group ran the production on the “wide” survey layer, consisting of ~ 5000 exposures covering ~ 100 deg^2 in grizy to r ~ 26 mag. We used a Princeton slurm cluster with GPFS filesystem, expecting to use about 2000 core-weeks and that’s about what we used if we factor out mistakes. The pipeline execution used what’s now in the LSST packages ctrl_pool and pipe_drivers to do the following operations:
- reduceFrames: run processCcd on each CCD, gather the match lists and fit an exposure-wide astrometric solution which is then put in each CCD before writing. We ran several instances of this on 112 cores each (each exposure has 112 CCDs).
- mosaic: run meas_mosaic on each tract,filter to get a consistent astrometric and photometric solution. meas_mosaic uses threads in the matrix solvers (via MKL), so we ran one per node.
- stack: run makeCoaddTempExp, assembleCoadd and detectCoaddSources on each tract,filter. We ran several instances of this on 81 cores each (tracts have 9x9 patches).
- multiband: run mergeCoaddDetections, measureCoaddSources, mergeCoaddMeasurements, forcedPhotCoadd on each tract. We ran several instances of this on 81 cores each.
Problems in the processing that we noticed include:
- Keeping track of what processing has run successfully, and what has failed is difficult because the notification of success or failure is buried in the slurm logs.
- The process pool (ctrl_pool; or, at least, our use of it) has inefficiencies: the program cannot move forward until all jobs are completed, so you can easily get a situation where 80 cores are waiting for 1 core that’s still working.
- The stack process hammers the storage to the point of making it unusable for others when we have multiple processes running. The problem appears to be in the initial reading the WCS for each of the input visits (including for CCDs that aren’t contributing to the tract), as GPFS doesn’t work well for many small reads.
To address these, I think we need to move all the processing into a single process that can run each component, track success or failure, and share information. I think this goes beyond just boxing out so many cores in the cluster for a Condor Glidein that we can then operate as our own cluster, because we need something more than just dependencies expressed at the job scheduler level. I’ve been thinking to use something DAG-based, and have been looking at Apache Spark which I’m thinking we could use effectively. This would allow us to express the processing DAG in python code (using PySpark), and I think it would give us a superset of the features in ctrl_pool. An alternative would be to continue with discrete jobs, adding in dependencies between jobs, but add a processing database that tracks the state; but I think we need a single process in order to deal with the inefficiencies with the process pool, and I think that would also be simpler.
Some questions:
- Does this sound like a reasonable plan? Does it mesh with LSST plans? Might the HSC work produce a prototype that would be useful for LSST (to either use, build upon or learn from)?
- Is Spark a reasonable choice? Are there alternatives we should consider?
- Do any of you more experienced in such technologies than I have any advice?